使用文字的更快，更简单，更准确的混合动力ASR系统

论文标题

使用文字的更快，更简单，更准确的混合动力ASR系统

Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

论文作者

Zhang, Frank, Wang, Yongqiang, Zhang, Xiaohui, Liu, Chunxi, Saraf, Yatharth, Zweig, Geoffrey

论文摘要

在这项工作中，我们首先表明，在广泛使用的LibrisPeech基准上，我们基于变压器的上下文依赖性连接时间分类（CTC）系统会产生最新的结果。然后，我们表明，通过排除所有GMM引导，决策树的建设和力对齐步骤，使用词汇表作为建模单元与CTC培训相结合，我们可以极大地简化工程管道，同时仍然可以实现非常有竞争力的Word-error-rate。此外，使用词汇表作为建模单元可以显着提高运行时效率，因为我们可以使用较大的步幅而不会失去准确性。我们在两个内部视频数据集上进一步证实了这些发现：德语，类似于英语作为融合语言，而土耳其语是一种凝聚力的语言。

In this work, we first show that on the widely used LibriSpeech benchmark, our transformer-based context-dependent connectionist temporal classification (CTC) system produces state-of-the-art results. We then show that using wordpieces as modeling units combined with CTC training, we can greatly simplify the engineering pipeline compared to conventional frame-based cross-entropy training by excluding all the GMM bootstrapping, decision tree building and force alignment steps, while still achieving very competitive word-error-rate. Additionally, using wordpieces as modeling units can significantly improve runtime efficiency since we can use larger stride without losing accuracy. We further confirm these findings on two internal VideoASR datasets: German, which is similar to English as a fusional language, and Turkish, which is an agglutinative language.

下载PDF全文

下载文献需遵守相关版权规定

论文标题