论文标题
Convrnn-T:用于流语音识别的卷积增强复发性神经网络传感器
ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition
论文作者
论文摘要
复发性神经网络传感器(RNN-T)是一种突出的端到端流(E2E)ASR技术。在RNN-T中,声学编码器通常由LSTMS组成。最近,作为LSTM层的替代方案,引入了构象异构体架构,其中RNN-T的编码器被改良的变压器编码器代替,该编码器由前端和注意力层之间的卷积层组成。在本文中,我们介绍了一种新的流媒体ASR模型,卷积增强的复发性神经网络传感器(Convrnn-T),在该模型中,我们通过由局部和全球上下文CNN编码器组成的新型卷积前端增强了基于LSTM的RNN-T。 Convrnn-T利用因果1-D卷积层,挤压和兴奋,扩张和残留块,为LSTM层提供全球和局部音频上下文表示。我们在LibrisPeech和内部数据上显示了Convrnn-T优于RNN-T,Condormer和ContextNet。此外,与构象异构体相比,Convrnn-T提供的计算复杂性较小。 Convrnn-T的卓越准确性及其占地面积低,使其成为现场流媒体ASR技术的有前途的候选人。
The recurrent neural network transducer (RNN-T) is a prominent streaming end-to-end (E2E) ASR technology. In RNN-T, the acoustic encoder commonly consists of stacks of LSTMs. Very recently, as an alternative to LSTM layers, the Conformer architecture was introduced where the encoder of RNN-T is replaced with a modified Transformer encoder composed of convolutional layers at the frontend and between attention layers. In this paper, we introduce a new streaming ASR model, Convolutional Augmented Recurrent Neural Network Transducers (ConvRNN-T) in which we augment the LSTM-based RNN-T with a novel convolutional frontend consisting of local and global context CNN encoders. ConvRNN-T takes advantage of causal 1-D convolutional layers, squeeze-and-excitation, dilation, and residual blocks to provide both global and local audio context representation to LSTM layers. We show ConvRNN-T outperforms RNN-T, Conformer, and ContextNet on Librispeech and in-house data. In addition, ConvRNN-T offers less computational complexity compared to Conformer. ConvRNN-T's superior accuracy along with its low footprint make it a promising candidate for on-device streaming ASR technologies.