带有变压器的端到端多演讲者语音识别

论文标题

带有变压器的端到端多演讲者语音识别

End-to-End Multi-speaker Speech Recognition with Transformer

论文作者

Chang, Xuankai, Zhang, Wangyou, Qian, Yanmin, Roux, Jonathan Le, Watanabe, Shinji

论文摘要

最近，已证明基于完全复发的神经网络（RNN）端到端模型在单渠道和多渠道方案中对多演讲者的语音识别均有效。在这项工作中，我们通过关注两个方面来探讨这些任务的变压器模型的使用。首先，我们用变压器体系结构替换语音识别模型中基于RNN的编码器。其次，为了在多通道情况下使用神经光束器的掩蔽网络中使用变压器，我们将自我发项项组件限制在段而不是整个序列中以减少计算。除了改进模型体系结构外，我们还结合了外部替补预处理，加权预测误差（WPE），使我们的模型能够处理回复的信号。在空间化的WSJ1-2MIX语料库上进行的实验表明，基于变压器的模型在单通道和多通道任务中分别实现了40.9％和25.6％的相对降低，降至12.1％和6.4％，而在单渠道和多通道任务中，我们的方法分别为41.5％和13.8％和13.8％和13.8％和13.8％的范围，降低了。 wer。

Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9% and 25.6% relative WER reduction, down to 12.1% and 6.4% WER, under the anechoic condition in single-channel and multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5% and 13.8% relative WER reduction, down to 16.5% and 15.2% WER.

下载PDF全文

下载文献需遵守相关版权规定

论文标题