Convrnn-T：用于流语音识别的卷积增强复发性神经网络传感器

论文标题

Convrnn-T：用于流语音识别的卷积增强复发性神经网络传感器

ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition

论文作者

Radfar, Martin, Barnwal, Rohit, Swaminathan, Rupak Vignesh, Chang, Feng-Ju, Strimel, Grant P., Susanj, Nathan, Mouchtaris, Athanasios

论文摘要

复发性神经网络传感器（RNN-T）是一种突出的端到端流（E2E）ASR技术。在RNN-T中，声学编码器通常由LSTMS组成。最近，作为LSTM层的替代方案，引入了构象异构体架构，其中RNN-T的编码器被改良的变压器编码器代替，该编码器由前端和注意力层之间的卷积层组成。在本文中，我们介绍了一种新的流媒体ASR模型，卷积增强的复发性神经网络传感器（Convrnn-T），在该模型中，我们通过由局部和全球上下文CNN编码器组成的新型卷积前端增强了基于LSTM的RNN-T。 Convrnn-T利用因果1-D卷积层，挤压和兴奋，扩张和残留块，为LSTM层提供全球和局部音频上下文表示。我们在LibrisPeech和内部数据上显示了Convrnn-T优于RNN-T，Condormer和ContextNet。此外，与构象异构体相比，Convrnn-T提供的计算复杂性较小。 Convrnn-T的卓越准确性及其占地面积低，使其成为现场流媒体ASR技术的有前途的候选人。

The recurrent neural network transducer (RNN-T) is a prominent streaming end-to-end (E2E) ASR technology. In RNN-T, the acoustic encoder commonly consists of stacks of LSTMs. Very recently, as an alternative to LSTM layers, the Conformer architecture was introduced where the encoder of RNN-T is replaced with a modified Transformer encoder composed of convolutional layers at the frontend and between attention layers. In this paper, we introduce a new streaming ASR model, Convolutional Augmented Recurrent Neural Network Transducers (ConvRNN-T) in which we augment the LSTM-based RNN-T with a novel convolutional frontend consisting of local and global context CNN encoders. ConvRNN-T takes advantage of causal 1-D convolutional layers, squeeze-and-excitation, dilation, and residual blocks to provide both global and local audio context representation to LSTM layers. We show ConvRNN-T outperforms RNN-T, Conformer, and ContextNet on Librispeech and in-house data. In addition, ConvRNN-T offers less computational complexity compared to Conformer. ConvRNN-T's superior accuracy along with its low footprint make it a promising candidate for on-device streaming ASR technologies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题