互动性语音和噪音建模，以增强语音

论文标题

互动性语音和噪音建模，以增强语音

Interactive Speech and Noise Modeling for Speech Enhancement

论文作者

Zheng, Chengyu, Peng, Xiulian, Zhang, Yuan, Srinivasan, Sriram, Lu, Yan

论文摘要

由于背景噪声类型的多样性，语音增强是具有挑战性的。大多数现有方法都集中在建模语音而不是噪声上。在本文中，我们提出了一个新颖的想法，以在两个分支的卷积神经网络（即SN-NET）中同时建模语音和噪声。在SN-NET中，两个分支分别预测语音和噪声。而不是仅在最终输出层处的信息融合，而是在两个分支之间的几个中间特征域引入相互作用模块以相互受益。这种相互作用可以利用从一个分支中学到的特征来抵消不希望的部分，并恢复另一个分支的缺失部分，从而增强其歧视能力。我们还设计了一个特征提取模块，即残余横卷和注意力（RA），以捕获语音和噪声沿时间和频率维度沿时间和频率维度的相关性。公共数据集上的评估表明，交互模块在同时建模中起关键作用，而SN-NET在各种评估指标上大量优于最先进的方法。拟议的SN-NET还显示出较高的扬声器分离性能。

Speech enhancement is challenging because of the diversity of background noise types. Most of the existing methods are focused on modelling the speech rather than the noise. In this paper, we propose a novel idea to model speech and noise simultaneously in a two-branch convolutional neural network, namely SN-Net. In SN-Net, the two branches predict speech and noise, respectively. Instead of information fusion only at the final output layer, interaction modules are introduced at several intermediate feature domains between the two branches to benefit each other. Such an interaction can leverage features learned from one branch to counteract the undesired part and restore the missing component of the other and thus enhance their discrimination capabilities. We also design a feature extraction module, namely residual-convolution-and-attention (RA), to capture the correlations along temporal and frequency dimensions for both the speech and the noises. Evaluations on public datasets show that the interaction module plays a key role in simultaneous modeling and the SN-Net outperforms the state-of-the-art by a large margin on various evaluation metrics. The proposed SN-Net also shows superior performance for speaker separation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题