论文标题
使用上下文音频提高RNN-T ASR精度
Improving RNN-T ASR Accuracy Using Context Audio
论文作者
论文摘要
我们提出了基于复发性神经网络传感器(RNN-T)流式传输自动语音识别(ASR)的培训方案,该方案允许编码器网络在训练过程中使用流中的分段或部分标记的序列学习从流中利用上下文音频。我们表明,在训练和推理过程中,在训练和推理过程中使用上下文音频可以导致语音助手ASR系统的现实生产环境中的单词错误率降低超过6%。我们研究了拟议的培训方法对包含背景语音的声学挑战性数据的影响,这些数据表明该方法有助于网络学习说话者和环境适应。为了进一步了解基于长期记忆(LSTM)ASR编码器利用长期背景的能力,我们还可以看到相对于输入的RNN-T损失梯度。
We present a training scheme for streaming automatic speech recognition (ASR) based on recurrent neural network transducers (RNN-T) which allows the encoder network to learn to exploit context audio from a stream, using segmented or partially labeled sequences of the stream during training. We show that the use of context audio during training and inference can lead to word error rate reductions of more than 6% in a realistic production setting for a voice assistant ASR system. We investigate the effect of the proposed training approach on acoustically challenging data containing background speech and present data points which indicate that this approach helps the network learn both speaker and environment adaptation. To gain further insight into the ability of a long short-term memory (LSTM) based ASR encoder to exploit long-term context, we also visualize RNN-T loss gradients with respect to the input.