上下文网络：改善卷积神经网络，以自动语音识别具有全球环境

论文标题

上下文网络：改善卷积神经网络，以自动语音识别具有全球环境

ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

论文作者

Han, Wei, Zhang, Zhengdong, Zhang, Yu, Yu, Jiahui, Chiu, Chung-Cheng, Qin, James, Gulati, Anmol, Pang, Ruoming, Wu, Yonghui

论文摘要

卷积神经网络（CNN）在端到端语音识别方面显示出令人鼓舞的结果，尽管仍然落后于其他最先进的绩效方法。在本文中，我们研究了如何弥合这一差距，并越过一种新颖的CNN-RNN-Transducer架构，我们称之为ContextNetnet。 ContextNet具有完全卷积的编码器，该编码器通过添加挤压和激发模块将全局上下文信息纳入卷积层中。此外，我们提出了一种简单的缩放方法，该方法可以缩放上下文网的宽度，从而在计算和准确性之间实现良好的权衡。我们证明，在广泛使用的LiblisPeech基准测试中，ContextNet在没有外部语言模型（LM）的情况下达到2.1％/4.6％的单词错误率（WER），使用LM和2.9％/7.0％的1.9％/4.1％，并且在干净的/Noisy noisy liblispeech liblispeech test设置上只有100m参数。相比之下，LM的先前最佳发布系统为2.0％/4.6％，参数为3.9％/11.3％。在更大的内部数据集中还验证了所提出的上下文模型的优势。

Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the previous best published system of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题