论文标题
改进的神经语言模型融合用于流式复发神经网络传感器
Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer
论文作者
论文摘要
像大多数端到端的语音识别模型体系结构一样,经常性的神经网络传感器(RNN-T)具有隐式神经网络语言模型(NNLM),并且在培训过程中无法轻松利用未配对的文本数据。先前的工作提出了各种融合方法,将外部NNLM纳入端到端ASR以解决这一弱点。在本文中,我们提出了这些技术的扩展,这些技术允许RNN-T在训练和推理时间内利用外部NNLM,与强基线相比,LibrisPeech的相对单词错误率提高了13-18%。此外,我们的方法不会产生额外的算法延迟,并且可以在不重新训练的情况下灵活地插入不同的NNLM。我们还共享深入的分析,以更好地了解不同NNLM融合方法的好处。我们的工作提供了一种可靠的技术,用于利用未配对的文本数据,以显着改善RNN-T,同时保持系统流媒体,灵活和轻量级。
Recurrent Neural Network Transducer (RNN-T), like most end-to-end speech recognition model architectures, has an implicit neural network language model (NNLM) and cannot easily leverage unpaired text data during training. Previous work has proposed various fusion methods to incorporate external NNLMs into end-to-end ASR to address this weakness. In this paper, we propose extensions to these techniques that allow RNN-T to exploit external NNLMs during both training and inference time, resulting in 13-18% relative Word Error Rate improvement on Librispeech compared to strong baselines. Furthermore, our methods do not incur extra algorithmic latency and allow for flexible plug-and-play of different NNLMs without re-training. We also share in-depth analysis to better understand the benefits of the different NNLM fusion methods. Our work provides a reliable technique for leveraging unpaired text data to significantly improve RNN-T while keeping the system streamable, flexible, and lightweight.