Look \＆听：主动扬声器检测和语音增强的多模式相关学习

论文标题

Look \＆听：主动扬声器检测和语音增强的多模式相关学习

Look\&Listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

论文作者

Xiong, Junwen, Zhou, Yu, Zhang, Peng, Xie, Lei, Huang, Wei, Zha, Yufei

论文摘要

在视听方案的理解中，主动演讲者的检测和语音增强已成为两个日益吸引人的主题。根据它们各自的特征，独立设计的体系结构方案已被广泛用于与每个任务的对应。这可能导致模型特定于任务所学的表示形式，并且不可避免地会导致基于多模式建模的功能缺乏概括能力。最近的研究表明，建立听觉和视觉流之间的跨模式关系是针对视听多任务学习挑战的有前途的解决方案。因此，作为弥合视听任务中多模式关联的动机，提出了一个统一的框架，以通过在本研究中对视听模型进行联合学习来实现目标扬声器的检测和语音增强。

Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been widely used in correspondence to each single task. This may lead to the representation learned by the model being task-specific, and inevitably result in the lack of generalization ability of the feature based on multi-modal modeling. More recent studies have shown that establishing cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning. Therefore, as a motivation to bridge the multi-modal associations in audio-visual tasks, a unified framework is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling in this study.

下载PDF全文

下载文献需遵守相关版权规定

论文标题