通过标准化联接网络改善RNN传感器

论文标题

通过标准化联接网络改善RNN传感器

Improving RNN transducer with normalized jointer network

论文作者

Huang, Mingkun, Zhang, Jun, Cai, Meng, Zhang, Yang, Yao, Jiali, You, Yongbin, He, Yi, Ma, Zejun

论文摘要

复发性神经传感器（RNN-T）是自动语音识别（ASR）中有希望的端到端（E2E）模型。与传统的混合ASR系统相比，它表现出卓越的性能。但是，从头开始培训RNN-T仍然具有挑战性。我们在RNN-T训练期间观察到巨大的梯度差异，并怀疑它会损害性能。在这项工作中，我们分析了RNN-T训练中巨大梯度差异的原因，并提出了一个新的\ textit {归一化的联接网络}来克服它。我们还建议通过修改的构象异构器网络和变形金刚-XL预测网络增强RNN-T网络，以实现最佳性能。实验是在开放的170小时Aishell-1和工业级30000小时的普通话数据集上进行的。在AISHELL-1数据集上，我们的RNN-T系统分别以CER 6.15 \％和5.37 \％的流式传输和非流程基准获得最新的结果。我们进一步将RNN-T系统与训练有素的商业混合系统进行了比较，在30000小时行业的音频数据上，并在没有预培训或外部语言模型的情况下获得9 \％的相对改进。

Recurrent neural transducer (RNN-T) is a promising end-to-end (E2E) model in automatic speech recognition (ASR). It has shown superior performance compared to traditional hybrid ASR systems. However, training RNN-T from scratch is still challenging. We observe a huge gradient variance during RNN-T training and suspect it hurts the performance. In this work, we analyze the cause of the huge gradient variance in RNN-T training and proposed a new \textit{normalized jointer network} to overcome it. We also propose to enhance the RNN-T network with a modified conformer encoder network and transformer-XL predictor networks to achieve the best performance. Experiments are conducted on the open 170-hour AISHELL-1 and industrial-level 30000-hour mandarin speech dataset. On the AISHELL-1 dataset, our RNN-T system gets state-of-the-art results on AISHELL-1's streaming and non-streaming benchmark with CER 6.15\% and 5.37\% respectively. We further compare our RNN-T system with our well trained commercial hybrid system on 30000-hour-industry audio data and get 9\% relative improvement without pre-training or external language model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题