视觉上下文驱动的音频功能增强功能，可用于稳健的端到端视听语音识别

论文标题

视觉上下文驱动的音频功能增强功能，可用于稳健的端到端视听语音识别

Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition

论文作者

Hong, Joanna, Kim, Minsu, Yoo, Daehun, Ro, Yong Man

论文摘要

本文着重于设计一种噪声端到端的视听语音识别（AVSR）系统。为此，我们提出了视觉上下文驱动的音频功能增强模块（V-cafe），以在视听通讯的帮助下增强输入嘈杂的音频语音。所提出的V-Cafe旨在捕获唇部运动的过渡，即视觉上下文，并通过考虑获得的视觉上下文来产生降噪面膜。通过与上下文相关的建模，可以完善掩模生成Viseme-to-phoneme映射中的歧义。嘈杂的表示用降噪面膜掩盖，从而产生了增强的音频功能。增强的音频功能与视觉特征融合在一起，并将其带入由构象异构体和变压器组成的编码器模型，以进行语音识别。我们显示了带有V-fafe的端到端AVSR，可以进一步改善AVSR的噪声。使用两个最大的视听数据集LRS2和LRS3在嘈杂的语音识别和重叠的语音识别实验中评估了所提出方法的有效性。

This paper focuses on designing a noise-robust end-to-end Audio-Visual Speech Recognition (AVSR) system. To this end, we propose Visual Context-driven Audio Feature Enhancement module (V-CAFE) to enhance the input noisy audio speech with a help of audio-visual correspondence. The proposed V-CAFE is designed to capture the transition of lip movements, namely visual context and to generate a noise reduction mask by considering the obtained visual context. Through context-dependent modeling, the ambiguity in viseme-to-phoneme mapping can be refined for mask generation. The noisy representations are masked out with the noise reduction mask resulting in enhanced audio features. The enhanced audio features are fused with the visual features and taken to an encoder-decoder model composed of Conformer and Transformer for speech recognition. We show the proposed end-to-end AVSR with the V-CAFE can further improve the noise-robustness of AVSR. The effectiveness of the proposed method is evaluated in noisy speech recognition and overlapped speech recognition experiments using the two largest audio-visual datasets, LRS2 and LRS3.

下载PDF全文

下载文献需遵守相关版权规定

论文标题