CI-AVSR：一个粤语视听语音数据集用于车内命令识别

论文标题

CI-AVSR：一个粤语视听语音数据集用于车内命令识别

CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition

论文作者

Dai, Wenliang, Cahyawijaya, Samuel, Yu, Tiezheng, Barezi, Elham J., Xu, Peng, Yiu, Cheuk Tung Shadow, Frieske, Rita, Lovenia, Holy, Winata, Genta Indra, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram E., Fung, Pascale

论文摘要

随着深度学习和智能车辆的兴起，智能助手已成为促进驾驶和提供额外功能的车内组件。车内智能助手应能够处理一般以及与汽车相关的命令并执行相应的操作，从而简化驾驶并提高安全性。但是，低资源语言存在数据稀缺问题，从而阻碍了研究和应用的发展。在本文中，我们介绍了一个新的数据集，广东话音频语音识别（CI-AVSR），以使用视频和音频数据，以用粤语识别车内命令识别。它由30名本地广东话人录制的200个车内命令中的4,984个样本（8.3小时）组成。此外，我们使用常见的车载背景噪声增强数据集以模拟真实环境，从而产生比收集的数据集的10倍。我们提供了数据集的清洁版和增强版本的详细统计信息。此外，我们实施了两个多模式基线，以证明CI-AVSR的有效性。实验结果表明，利用视觉信号可以改善模型的整体性能。尽管我们的最佳模型可以在干净的测试集上实现相当大的质量，但是嘈杂数据上的语音识别质量仍然较低，并且仍然是真正的车载式语音识别系统的一项极具挑战性的任务。数据集和代码将在https://github.com/hltchkust/ci-avsr上发布。

With the rise of deep learning and intelligent vehicle, the smart assistant has become an essential in-car component to facilitate driving and provide extra functionalities. In-car smart assistants should be able to process general as well as car-related commands and perform corresponding actions, which eases driving and improves safety. However, there is a data scarcity issue for low resource languages, hindering the development of research and applications. In this paper, we introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR), for in-car command recognition in the Cantonese language with both video and audio data. It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers. Furthermore, we augment our dataset using common in-car background noises to simulate real environments, producing a dataset 10 times larger than the collected one. We provide detailed statistics of both the clean and the augmented versions of our dataset. Moreover, we implement two multimodal baselines to demonstrate the validity of CI-AVSR. Experiment results show that leveraging the visual signal improves the overall performance of the model. Although our best model can achieve a considerable quality on the clean test set, the speech recognition quality on the noisy data is still inferior and remains as an extremely challenging task for real in-car speech recognition systems. The dataset and code will be released at https://github.com/HLTCHKUST/CI-AVSR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题