论文标题
DCCRN:深度复杂的卷积复发网络,用于增强时期语音
DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement
论文作者
论文摘要
言语增强已从深度学习的成功中受益,从可理解性和感知质量方面受益。常规的时频(TF)域方法着重于通过天真的卷积神经网络(CNN)或经常性神经网络(RNN)预测TF面罩或语音频谱。一些最近的研究使用复杂值谱图作为训练目标,但在实现的网络中进行训练,分别预测了相位成分或实际和虚构部分。尤其是,卷积复发网络(CRN)集成了卷积编码器(CED)结构和长期记忆(LSTM),这已被证明对复杂目标有帮助。为了更有效地训练复杂的目标,我们设计了一个新的网络结构,模拟了复杂值的操作,称为“深层复杂卷积复发网络(DCCRN)”,其中CNN和RNN结构都可以处理复杂值的操作。所提出的DCCRN模型与先前的其他网络相比,无论是客观的还是主观的指标。只有370万参数,我们的DCCRN模型提交给Interspeech 2020深噪声抑制(DNS)挑战在实时轨道上排名第一,而在平均意见分数(MOS)方面,非实时赛道排名第二。
Speech enhancement has benefited from the success of deep learning in terms of intelligibility and perceptual quality. Conventional time-frequency (TF) domain methods focus on predicting TF-masks or speech spectrum, via a naive convolution neural network (CNN) or recurrent neural network (RNN). Some recent studies use complex-valued spectrogram as a training target but train in a real-valued network, predicting the magnitude and phase component or real and imaginary part, respectively. Particularly, convolution recurrent network (CRN) integrates a convolutional encoder-decoder (CED) structure and long short-term memory (LSTM), which has been proven to be helpful for complex targets. In order to train the complex target more effectively, in this paper, we design a new network structure simulating the complex-valued operation, called Deep Complex Convolution Recurrent Network (DCCRN), where both CNN and RNN structures can handle complex-valued operation. The proposed DCCRN models are very competitive over other previous networks, either on objective or subjective metric. With only 3.7M parameters, our DCCRN models submitted to the Interspeech 2020 Deep Noise Suppression (DNS) challenge ranked first for the real-time-track and second for the non-real-time track in terms of Mean Opinion Score (MOS).