论文标题
通过标准化联接网络改善RNN传感器
Improving RNN transducer with normalized jointer network
论文作者
论文摘要
复发性神经传感器(RNN-T)是自动语音识别(ASR)中有希望的端到端(E2E)模型。与传统的混合ASR系统相比,它表现出卓越的性能。但是,从头开始培训RNN-T仍然具有挑战性。我们在RNN-T训练期间观察到巨大的梯度差异,并怀疑它会损害性能。在这项工作中,我们分析了RNN-T训练中巨大梯度差异的原因,并提出了一个新的\ textit {归一化的联接网络}来克服它。我们还建议通过修改的构象异构器网络和变形金刚-XL预测网络增强RNN-T网络,以实现最佳性能。实验是在开放的170小时Aishell-1和工业级30000小时的普通话数据集上进行的。在AISHELL-1数据集上,我们的RNN-T系统分别以CER 6.15 \%和5.37 \%的流式传输和非流程基准获得最新的结果。我们进一步将RNN-T系统与训练有素的商业混合系统进行了比较,在30000小时行业的音频数据上,并在没有预培训或外部语言模型的情况下获得9 \%的相对改进。
Recurrent neural transducer (RNN-T) is a promising end-to-end (E2E) model in automatic speech recognition (ASR). It has shown superior performance compared to traditional hybrid ASR systems. However, training RNN-T from scratch is still challenging. We observe a huge gradient variance during RNN-T training and suspect it hurts the performance. In this work, we analyze the cause of the huge gradient variance in RNN-T training and proposed a new \textit{normalized jointer network} to overcome it. We also propose to enhance the RNN-T network with a modified conformer encoder network and transformer-XL predictor networks to achieve the best performance. Experiments are conducted on the open 170-hour AISHELL-1 and industrial-level 30000-hour mandarin speech dataset. On the AISHELL-1 dataset, our RNN-T system gets state-of-the-art results on AISHELL-1's streaming and non-streaming benchmark with CER 6.15\% and 5.37\% respectively. We further compare our RNN-T system with our well trained commercial hybrid system on 30000-hour-industry audio data and get 9\% relative improvement without pre-training or external language model.