利用单模式的多模式音频语音识别

论文标题

利用单模式的多模式音频语音识别

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

论文作者

Pan, Xichen, Chen, Peiyu, Gong, Yichen, Zhou, Helong, Wang, Xinbing, Lin, Zhouhan

论文摘要

基于培训变压器的模型需要大量数据，同时获得多模式的对齐和标记的数据是相当要求的，特别是对于视听语音识别（AVSR）。因此，使用未标记的单峰数据很有意义。另一方面，尽管大规模自学学习的有效性在音频和视觉方式中都很好地确定，但如何将这些预训练的模型整合到多模式的情况下仍然没有被驱散。在这项工作中，我们成功地利用了单峰自学学习的学习来促进多模式的AVSR。特别是，在大规模的单峰数据集上对音频和视觉前端进行了训练，然后我们将两个前端的组件集成到较大的多模式框架中，该框架学会通过CTC和SEQ2SEQ解码的组合组合将并行视听数据识别为字符。我们表明，这两个组件都从单峰自我监督的学习良好合作，从而导致多模式框架通过微调产生竞争结果。我们的模型在单词级别和句子级任务上都经过实验验证。尤其是，即使没有外部语言模型，我们提出的模型也会在广泛接受的唇读句子2（LRS2）数据集上提高最先进的性能，相对提高了30％。

Training Transformer-based models demands a large amount of data, while obtaining aligned and labelled data in multimodality is rather cost-demanding, especially for audio-visual speech recognition (AVSR). Thus it makes a lot of sense to make use of unlabelled unimodal data. On the other side, although the effectiveness of large-scale self-supervised learning is well established in both audio and visual modalities, how to integrate those pre-trained models into a multimodal scenario remains underexplored. In this work, we successfully leverage unimodal self-supervised learning to promote the multimodal AVSR. In particular, audio and visual front-ends are trained on large-scale unimodal datasets, then we integrate components of both front-ends into a larger multimodal framework which learns to recognize parallel audio-visual data into characters through a combination of CTC and seq2seq decoding. We show that both components inherited from unimodal self-supervised learning cooperate well, resulting in that the multimodal framework yields competitive results through fine-tuning. Our model is experimentally validated on both word-level and sentence-level tasks. Especially, even without an external language model, our proposed model raises the state-of-the-art performances on the widely accepted Lip Reading Sentences 2 (LRS2) dataset by a large margin, with a relative improvement of 30%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题