通过学习视听性和视觉性亲和力来改善视觉言语增强网络并注意多头的关注

论文标题

通过学习视听性和视觉性亲和力来改善视觉言语增强网络并注意多头的关注

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

论文作者

Xu, Xinmeng, Wang, Yang, Jia, Jie, Chen, Binbin, Li, Dejun

论文摘要

视听语音增强系统被认为是隔离和增强所需演讲者语音的有前途的解决方案之一。典型的方法着重于通过基于天真的卷积神经网络编码器架构来预测清洁语音频谱，并且这些方法a）不足以充分使用数据，b）无法有效地平衡视听特征。提出的模型通过a）通过a）在编码阶段融合音频和视觉特征的模型来减轻这些缺点特征。本文提出了注意视听多层特征融合模型，其中将MHCA单元应用于各个解码器的特征映射。提出的模型证明了网络与最先进模型相对于最新模型的出色性能。

Audio-visual speech enhancement system is regarded as one of promising solutions for isolating and enhancing speech of desired speaker. Typical methods focus on predicting clean speech spectrum via a naive convolution neural network based encoder-decoder architecture, and these methods a) are not adequate to use data fully, b) are unable to effectively balance audio-visual features. The proposed model alleviates these drawbacks by a) applying a model that fuses audio and visual features layer by layer in encoding phase, and that feeds fused audio-visual features to each corresponding decoder layer, and more importantly, b) introducing a 2-stage multi-head cross attention (MHCA) mechanism to infer audio-visual speech enhancement for balancing the fused audio-visual features and eliminating irrelevant features. This paper proposes attentional audio-visual multi-layer feature fusion model, in which MHCA units are applied to feature mapping at every layer of decoder. The proposed model demonstrates the superior performance of the network against the state-of-the-art models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题