基于多阶段详细双路线深度bilstm的语音分离，并具有辅助身份损失

论文标题

基于多阶段详细双路线深度bilstm的语音分离，并具有辅助身份损失

Speech Separation Based on Multi-Stage Elaborated Dual-Path Deep BiLSTM with Auxiliary Identity Loss

论文作者

Shi, Ziqiang, Liu, Rujie, Han, Jiqing

论文摘要

事实证明，具有双向双向长期记忆（BILSTM）块具有双向双向长期短期记忆（BILSTM）块的深神经网络在序列建模中非常有效，尤其是在语音分离中。这项工作调查了如何扩展双路bilstm，以导致一种新的最先进的方法，即Tastas，用于多对话的单声道语音分离（又称鸡尾酒会问题）。 Tastas引入了两个简单但有效的改进，一个是一种迭代的多阶段精炼方案，另一个是通过丧失说话者的身份一致性在分离的语音和原始语音之间纠正语音，以提高基于双PATH BILSTM网络的性能。 Tastas的话语是两个扬声器的混合话语，并将其映射到两个分开的话语中，每个话语都只包含一个说话者的声音。我们对著名基准WSJ0-2MIX数据语料库进行的实验可导致20.55DB SDR改进，20.35DB SI-SDR改进，PESQ的3.69和94.86 \％ESTOI，这表明我们的建议网络可以在扬声器分隔任务上提高绩效。我们已经开放了对Dprnn-tasnet的重新实现（https://github.com/shiziqiang/shiziqiang/dual-path-path-rnns-dprnns------------------------------------------ speech-separation），我们的tastas是基于Dprnn-tasnet的实现来实现的，因此可以将其置于该纸张中，因此可以根据自己的评价。

Deep neural network with dual-path bi-directional long short-term memory (BiLSTM) block has been proved to be very effective in sequence modeling, especially in speech separation. This work investigates how to extend dual-path BiLSTM to result in a new state-of-the-art approach, called TasTas, for multi-talker monaural speech separation (a.k.a cocktail party problem). TasTas introduces two simple but effective improvements, one is an iterative multi-stage refinement scheme, and the other is to correct the speech with imperfect separation through a loss of speaker identity consistency between the separated speech and original speech, to boost the performance of dual-path BiLSTM based networks. TasTas takes the mixed utterance of two speakers and maps it to two separated utterances, where each utterance contains only one speaker's voice. Our experiments on the notable benchmark WSJ0-2mix data corpus result in 20.55dB SDR improvement, 20.35dB SI-SDR improvement, 3.69 of PESQ, and 94.86\% of ESTOI, which shows that our proposed networks can lead to big performance improvement on the speaker separation task. We have open sourced our re-implementation of the DPRNN-TasNet here (https://github.com/ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation), and our TasTas is realized based on this implementation of DPRNN-TasNet, it is believed that the results in this paper can be reproduced with ease.

下载PDF全文

下载文献需遵守相关版权规定

论文标题