CSLNSpeech：借助汉语手语解决扩展的语音分离问题

论文标题

CSLNSpeech：借助汉语手语解决扩展的语音分离问题

CSLNSpeech: solving extended speech separation problem with the help of Chinese sign language

论文作者

Wu, Jiasong, Li, Xuan, Li, Taotao, Meng, Fanman, Kong, Youyong, Yang, Guanyu, Senhadji, Lotfi, Shu, Huazhong

论文摘要

以前的视听语音分离方法使用说话者在视频中的面部运动和语音的同步以一种自我监督的方式监督语音分离。在本文中，我们提出了一个模型来解决面部和手语的辅助语音分离问题，我们称之为扩展的语音分离问题。我们设计了一个一般的深度学习网络，用于学习三种方式，音频，面部和手语信息的组合，以更好地解决语音分离问题。为了训练该模型，我们介绍了一个名为中文手语新闻语音（CSLNSpeech）数据集的大型数据集，其中三种音频，面部和手语语言共存的方式共存。实验结果表明，所提出的模型比通常的视听系统具有更好的性能和鲁棒性。此外，手语模式也可以单独使用来监督语音分离任务，而手语的引入对听力受损的人有助于学习和交流。最后，我们的模型是一个通用的语音分离框架，可以在两个开源音频数据集上实现非常有竞争力的分离性能。该代码可从https://github.com/iveveive/slnspeech获得

Previous audio-visual speech separation methods use the synchronization of the speaker's facial movement and speech in the video to supervise the speech separation in a self-supervised way. In this paper, we propose a model to solve the speech separation problem assisted by both face and sign language, which we call the extended speech separation problem. We design a general deep learning network for learning the combination of three modalities, audio, face, and sign language information, for better solving the speech separation problem. To train the model, we introduce a large-scale dataset named the Chinese Sign Language News Speech (CSLNSpeech) dataset, in which three modalities of audio, face, and sign language coexist. Experiment results show that the proposed model has better performance and robustness than the usual audio-visual system. Besides, sign language modality can also be used alone to supervise speech separation tasks, and the introduction of sign language is helpful for hearing-impaired people to learn and communicate. Last, our model is a general speech separation framework and can achieve very competitive separation performance on two open-source audio-visual datasets. The code is available at https://github.com/iveveive/SLNSpeech

下载PDF全文

下载文献需遵守相关版权规定

论文标题