通过童军网络端到端流式流语音识别低潜伏期

论文标题

通过童军网络端到端流式流语音识别低潜伏期

Low Latency End-to-End Streaming Speech Recognition with a Scout Network

论文作者

Wang, Chengyi, Wu, Yu, Liu, Shujie, Li, Jinyu, Lu, Liang, Ye, Guoli, Zhou, Ming

论文摘要

基于注意力的变压器模型已在离线模式下实现了语音识别（SR）的有希望的结果。但是，在流媒体模式下，变压器模型通常会在每个编码器层中应用固定长度的look-ahead窗口时保持显着延迟以保持其识别精度。在本文中，我们为变压器模型提出了一种新型的低延迟流方法，该方法由侦察网络和识别网络组成。侦察网络可检测整个单词边界，而没有看到任何未来的帧，而识别网络通过利用预测边界之前的所有帧中的信息来预测下一个子字。我们的模型在测试清洁和测试中的数据集的LibrisPeech的数据集中仅能达到最佳性能（2.7/6.4 WER）。

The attention-based Transformer model has achieved promising results for speech recognition (SR) in the offline mode. However, in the streaming mode, the Transformer model usually incurs significant latency to maintain its recognition accuracy when applying a fixed-length look-ahead window in each encoder layer. In this paper, we propose a novel low-latency streaming approach for Transformer models, which consists of a scout network and a recognition network. The scout network detects the whole word boundary without seeing any future frames, while the recognition network predicts the next subword by utilizing the information from all the frames before the predicted boundary. Our model achieves the best performance (2.7/6.4 WER) with only 639 ms latency on the test-clean and test-other data sets of Librispeech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题