论文标题
变压器传感器:具有变压器编码器和RNN-T损失的流式语音识别模型
Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss
论文作者
论文摘要
在本文中,我们提出了一个可以在流语音识别系统中使用的变压器编码器的端到端语音识别模型。基于自我注意的变压器计算块用于独立编码音频和标签序列。音频编码器和标签编码器的激活都与馈送层相结合,以计算出声框架位置和标签历史记录的每种组合的标签空间上的概率分布。这类似于复发性神经网络传感器(RNN-T)模型,该模型使用RNN进行信息编码而不是变压器编码。该模型经过训练,其RNN-T损耗非常适合流媒体解码。我们在Librispeech数据集上介绍了结果,表明在变压器层中限制左上方文以进行自我注意,这使得可以在流媒体上进行解码,以用于流式传输,而准确性仅略有降解。我们还表明,我们的模型的全部注意力版本击败了LibrisPeech基准上的水平准确性。我们的结果还表明,我们可以通过参与有限数量的未来框架来弥合模型的全部关注和有限注意版本之间的差距。
In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution over the label space for every combination of acoustic frame position and label history. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders. The model is trained with the RNN-T loss well-suited to streaming decoding. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding computationally tractable for streaming, with only a slight degradation in accuracy. We also show that the full attention version of our model beats the-state-of-the art accuracy on the LibriSpeech benchmarks. Our results also show that we can bridge the gap between full attention and limited attention versions of our model by attending to a limited number of future frames.