在端到端的多渠道时域在混响环境中的语音分离

论文标题

在端到端的多渠道时域在混响环境中的语音分离

On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments

论文作者

Zhang, Jisi, Zorila, Catalin, Doddipatla, Rama, Barker, Jon

论文摘要

本文介绍了一种新方法，用于在混响环境中进行多通道时域语音分离。完全跨跨的神经网络结构已被用来直接将语音与多个麦克风记录分开，而无需传统的空间特征提取。为了减少混响对空间特征提取的影响，已经采用了舍覆的预处理方法来进一步改善分离性能。已经模拟了WSJ0-2MIX数据集的空间化版本以评估所提出的系统。源分离和分离信号的语音识别性能均已客观评估。实验表明，在具有常规特征的参考系统上，提出的全趋化网络分别将源分离度量和单词错误率（WER）提高了13％以上和50％。使用经过清洁和回荡的数据训练的声学模型，将编织作为预处理的预处理可以进一步将29％的相对降低。

This paper introduces a new method for multi-channel time domain speech separation in reverberant environments. A fully-convolutional neural network structure has been used to directly separate speech from multiple microphone recordings, with no need of conventional spatial feature extraction. To reduce the influence of reverberation on spatial feature extraction, a dereverberation pre-processing method has been applied to further improve the separation performance. A spatialized version of wsj0-2mix dataset has been simulated to evaluate the proposed system. Both source separation and speech recognition performance of the separated signals have been evaluated objectively. Experiments show that the proposed fully-convolutional network improves the source separation metric and the word error rate (WER) by more than 13% and 50% relative, respectively, over a reference system with conventional features. Applying dereverberation as pre-processing to the proposed system can further reduce the WER by 29% relative using an acoustic model trained on clean and reverberated data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题