通过预先训练的原声和语言模型改善非自动入学的端到端语音识别

论文标题

通过预先训练的原声和语言模型改善非自动入学的端到端语音识别

Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

论文作者

Deng, Keqi, Yang, Zehui, Watanabe, Shinji, Higuchi, Yosuke, Cheng, Gaofeng, Zhang, Pengyuan

论文摘要

尽管变形金刚在端到端（E2E）自动语音识别（ASR）中取得了令人鼓舞的结果，但它们的自回旋（AR）结构成为加速解码过程的瓶颈。对于现实世界的部署，希望在实现快速推理的同时，ASR系统非常准确。由于其快速推理速度，非自动回旋（NAR）模型已成为一种流行的替代方法，但它们仍然落后于AR系统以识别准确性。为了满足这两个要求，在本文中，我们提出了使用预训练的声学和语言模型的NAR CTC/注意模型：WAV2VEC2.0和BERT。为了弥合从预训练模型获得的语音和文本表示之间的模态差距，我们设计了一种新型的模态转换机制，该机制更适合于逻辑语言。在推断期间，我们采用CTC分支来生成目标长度，这使BERT能够并行预测令牌。我们还设计了一种基于缓存的CTC/注意关节解码方法，以提高识别精度，同时保持解码速度的速度。实验结果表明，所提出的NAR模型的表现极大地超过了我们的强wav2Vec2.0 CTC基线（Aishell-1上的相对CER相对降低15.1％）。拟议的NAR模型显着超过了Aishell-1基准上的先前NAR系统，并显示了英语任务的潜力。

While Transformers have achieved promising results in end-to-end (E2E) automatic speech recognition (ASR), their autoregressive (AR) structure becomes a bottleneck for speeding up the decoding process. For real-world deployment, ASR systems are desired to be highly accurate while achieving fast inference. Non-autoregressive (NAR) models have become a popular alternative due to their fast inference speed, but they still fall behind AR systems in recognition accuracy. To fulfill the two demands, in this paper, we propose a NAR CTC/attention model utilizing both pre-trained acoustic and language models: wav2vec2.0 and BERT. To bridge the modality gap between speech and text representations obtained from the pre-trained models, we design a novel modality conversion mechanism, which is more suitable for logographic languages. During inference, we employ a CTC branch to generate a target length, which enables the BERT to predict tokens in parallel. We also design a cache-based CTC/attention joint decoding method to improve the recognition accuracy while keeping the decoding speed fast. Experimental results show that the proposed NAR model greatly outperforms our strong wav2vec2.0 CTC baseline (15.1% relative CER reduction on AISHELL-1). The proposed NAR model significantly surpasses previous NAR systems on the AISHELL-1 benchmark and shows a potential for English tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题