使用主动扬声器注意模块的端到端多访问者视听ASR

论文标题

使用主动扬声器注意模块的端到端多访问者视听ASR

End-to-end multi-talker audio-visual ASR using an active speaker attention module

论文作者

Rose, Richard, Siohan, Olivier

论文摘要

本文提出了一种新的方法，用于端到端的音频多访问者语音识别。这种方法在这里称为视觉上下文注意模型（VCAM），这很重要，因为它使用可用的视频信息将解码的文本分配给多个可见面孔之一。这本质上解决了与大多数多词器建模方法相关的标签歧义问题，这些方法可以解码多个标签字符串，但无法将标签字符串分配给正确的扬声器。这是作为基于变压器 - 透射器的端到端模型实现的，并使用由YouTube视频创建的两个扬声器视频 - 视频重叠的语音数据集进行了评估。在本文中显示，VCAM模型相对于先前报道的只有音频和视听的多态度ASR系统提高了性能。

This paper presents a new approach for end-to-end audio-visual multi-talker speech recognition. The approach, referred to here as the visual context attention model (VCAM), is important because it uses the available video information to assign decoded text to one of multiple visible faces. This essentially resolves the label ambiguity issue associated with most multi-talker modeling approaches which can decode multiple label strings but cannot assign the label strings to the correct speakers. This is implemented as a transformer-transducer based end-to-end model and evaluated using a two speaker audio-visual overlapping speech dataset created from YouTube videos. It is shown in the paper that the VCAM model improves performance with respect to previously reported audio-only and audio-visual multi-talker ASR systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题