论文标题

基于深度学习的视听语音增强和分离的概述

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

论文作者

Michelsanti, Daniel, Tan, Zheng-Hua, Zhang, Shi-Xiong, Xu, Yong, Yu, Meng, Yu, Dong, Jensen, Jesper

论文摘要

语音增强和语音分离是两个相关的任务,其目的是从几种来源产生的声音的混合物中分别提取一个或多个目标语音信号。传统上,这些任务是使用信号处理和应用于可用声学信号的机器学习技术来解决的。由于语音的视觉方面基本上不受声学环境的影响,因此来自目标扬声器的视觉信息(例如唇部运动和面部表情)也已用于语音增强和语音分离系统。为了有效地融合声学和视觉信息,研究人员利用了数据驱动方法的灵活性,特别是深度学习,实现了强大的性能。大量技术提取功能和融合多模式信息的不断提出的建议突显了对概述的必要性,该概述全面地描述和讨论了基于深度学习的视听语音增强和分离。在本文中,我们对该研究主题进行了系统的调查,重点介绍了文献中系统的主要要素:声学特征;视觉特征;深度学习方法;融合技术;培训目标和目标功能。此外,我们回顾了无声视频和视听声音源分离的基于深度学习的方法,用于非言论信号,因为这些方法可以或多或少地直接应用于视听语音的增强和分离。最后,我们调查了常见的视听语音数据集,鉴于它们在数据驱动方法的开发和评估方法中的核心作用,因为它们通常用于比较不同的系统并确定其性能。

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets and objective functions. In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源