流媒体ASR的更好，更快的端到端模型

论文标题

流媒体ASR的更好，更快的端到端模型

A Better and Faster End-to-End Model for Streaming ASR

论文作者

Li, Bo, Gulati, Anmol, Yu, Jiahui, Sainath, Tara N., Chiu, Chung-Cheng, Narayanan, Arun, Chang, Shuo-Yiin, Pang, Ruoming, He, Yanzhang, Qin, James, Han, Wei, Liang, Qiao, Zhang, Yu, Strohman, Trevor, Wu, Yonghui

论文摘要

端到端（E2E）模型已显示出在许多维度上的流式语音识别[1]的最先进模型，包括质量（按单词错误率（WER）衡量）和端量延迟[2]。但是，该模型仍然倾向于延迟末端的预测，因此与常规ASR模型相比，部分延迟更高。为了解决这个问题，我们研究通过一种称为bastemit [3]的算法来鼓励E2E模型尽早发出单词。自然，改善延迟会导致质量降解。为了解决这个问题，我们探索使用E2E模型的编码器中替换LSTM层的构象层[4]，该层显示了ASR的良好改进。其次，我们还探索了运行第二次通道梁搜索以提高质量的探索。为了确保第二次通行证快速完成，我们探索了进食相同的第一届乘RNN-T解码器的非伴奏构象层，这是一种称为级联编码器的算法[5]。总体而言，我们发现带有级联编码器的构型RNN-T为流媒体ASR提供了更好的质量和延迟权衡。

End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1st-pass RNN-T decoder, an algorithm called Cascaded Encoders [5]. Overall, we find that the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题