论文标题
开发实时流媒体变压器传感器以在大规模数据集上进行语音识别
Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset
论文作者
论文摘要
最近,基于变压器的端到端模型在包括语音识别在内的许多领域取得了巨大的成功。但是,与LSTM模型相比,推理期间变压器的沉重计算成本是防止其应用的关键问题。在这项工作中,我们探索了变压器传感器(T-T)模型对拳头通行证的潜力,该模型在大规模数据集上以低延迟和快速速度解码。我们将变压器-XL和块的流式流处理的想法结合在一起,以设计流媒体变压器传感器模型。我们证明,T-T优于混合模型,RNN传感器(RNN-T)和流式变压器注意的基于流动变压器的编码器模型。此外,可以通过相对较小的外观优化运行时的成本和延迟。
Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of the Transformer during inference is a key issue to prevent their applications. In this work, we explored the potential of Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset. We combine the idea of Transformer-XL and chunk-wise streaming processing to design a streamable Transformer Transducer model. We demonstrate that T-T outperforms the hybrid model, RNN Transducer (RNN-T), and streamable Transformer attention-based encoder-decoder model in the streaming scenario. Furthermore, the runtime cost and latency can be optimized with a relatively small look-ahead.