论文标题
利用单模式的多模式音频语音识别
Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition
论文作者
论文摘要
基于培训变压器的模型需要大量数据,同时获得多模式的对齐和标记的数据是相当要求的,特别是对于视听语音识别(AVSR)。因此,使用未标记的单峰数据很有意义。另一方面,尽管大规模自学学习的有效性在音频和视觉方式中都很好地确定,但如何将这些预训练的模型整合到多模式的情况下仍然没有被驱散。在这项工作中,我们成功地利用了单峰自学学习的学习来促进多模式的AVSR。特别是,在大规模的单峰数据集上对音频和视觉前端进行了训练,然后我们将两个前端的组件集成到较大的多模式框架中,该框架学会通过CTC和SEQ2SEQ解码的组合组合将并行视听数据识别为字符。我们表明,这两个组件都从单峰自我监督的学习良好合作,从而导致多模式框架通过微调产生竞争结果。我们的模型在单词级别和句子级任务上都经过实验验证。尤其是,即使没有外部语言模型,我们提出的模型也会在广泛接受的唇读句子2(LRS2)数据集上提高最先进的性能,相对提高了30%。
Training Transformer-based models demands a large amount of data, while obtaining aligned and labelled data in multimodality is rather cost-demanding, especially for audio-visual speech recognition (AVSR). Thus it makes a lot of sense to make use of unlabelled unimodal data. On the other side, although the effectiveness of large-scale self-supervised learning is well established in both audio and visual modalities, how to integrate those pre-trained models into a multimodal scenario remains underexplored. In this work, we successfully leverage unimodal self-supervised learning to promote the multimodal AVSR. In particular, audio and visual front-ends are trained on large-scale unimodal datasets, then we integrate components of both front-ends into a larger multimodal framework which learns to recognize parallel audio-visual data into characters through a combination of CTC and seq2seq decoding. We show that both components inherited from unimodal self-supervised learning cooperate well, resulting in that the multimodal framework yields competitive results through fine-tuning. Our model is experimentally validated on both word-level and sentence-level tasks. Especially, even without an external language model, our proposed model raises the state-of-the-art performances on the widely accepted Lip Reading Sentences 2 (LRS2) dataset by a large margin, with a relative improvement of 30%.