构象体：卷积扬名的变压器以进行语音识别

论文标题

构象体：卷积扬名的变压器以进行语音识别

Conformer: Convolution-augmented Transformer for Speech Recognition

论文作者

Gulati, Anmol, Qin, James, Chiu, Chung-Cheng, Parmar, Niki, Zhang, Yu, Yu, Jiahui, Han, Wei, Wang, Shibo, Zhang, Zhengdong, Wu, Yonghui, Pang, Ruoming

论文摘要

最近，基于变压器和卷积神经网络（CNN）模型显示出有希望的自动语音识别结果（ASR），表现优于复发性神经网络（RNNS）。变压器模型擅长捕获基于内容的全局交互，而CNNS有效利用本地功能。在这项工作中，我们通过研究如何将卷积神经网络和变压器结合以以参数有效的方式组合音频序列的局部和全球依赖性来实现两全其美。在这方面，我们提出了卷积激动人心的变压器，以供语音识别，名为Conformer。构象异构体的表现明显优于先前的变压器和基于CNN的模型，该模型可实现最先进的精度。在广泛使用的LibrisPeech基准测试中，我们的模型在不使用语言模型的情况下达到2.1％/4.3％，而在测试/测试中的外部语言模型则达到1.9％/3.9％。我们还可以观察到只有1000万参数的小模型的竞争性能为2.7％/6.3％。

Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. In this work, we achieve the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way. To this regard, we propose the convolution-augmented transformer for speech recognition, named Conformer. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother. We also observe competitive performance of 2.7%/6.3% with a small model of only 10M parameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题