论文标题
通过数据增强和语音表示改善多模式的语音识别
Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations
论文作者
论文摘要
多模式语音识别旨在通过利用通常与音频输入相关的其他视觉信息来提高自动语音识别(ASR)系统的性能。虽然先前的方法使强大的视觉表示至关重要,例如通过对预验证的图像识别网络进行填充,对其对应物的关注大大减少了:语音组成部分。在这项工作中,我们通过遵循与视觉编码器所使用的技术相似的技术来研究改善基本语音识别系统的方法,即传输表示和数据增强。首先,我们表明,从预计的ASR开始可以显着提高最新性能。值得注意的是,即使在强大的单峰系统建立基础上,我们仍然通过包括视觉模式来找到收益。其次,我们采用语音数据增强技术来鼓励多模式系统参加视觉刺激。该技术取代了先前使用的单词掩码,并带来了概念上更简单并在多模式设置中产生一致的改进的好处。我们为三个多模式数据集提供了经验结果,包括新引入的本地叙述。
Multimodal speech recognition aims to improve the performance of automatic speech recognition (ASR) systems by leveraging additional visual information that is usually associated to the audio input. While previous approaches make crucial use of strong visual representations, e.g. by finetuning pretrained image recognition networks, significantly less attention has been paid to its counterpart: the speech component. In this work, we investigate ways of improving the base speech recognition system by following similar techniques to the ones used for the visual encoder, namely, transferring representations and data augmentation. First, we show that starting from a pretrained ASR significantly improves the state-of-the-art performance; remarkably, even when building upon a strong unimodal system, we still find gains by including the visual modality. Second, we employ speech data augmentation techniques to encourage the multimodal system to attend to the visual stimuli. This technique replaces previously used word masking and comes with the benefits of being conceptually simpler and yielding consistent improvements in the multimodal setting. We provide empirical results on three multimodal datasets, including the newly introduced Localized Narratives.