具有两头上下文层轨迹LSTM模型的高准确性和低延迟语音识别

论文标题

具有两头上下文层轨迹LSTM模型的高准确性和低延迟语音识别

High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model

论文作者

Li, Jinyu, Zhao, Rui, Sun, Eric, Wong, Jeremy H. M., Das, Amit, Meng, Zhong, Gong, Yifan

论文摘要

尽管社区不断促进端到端模型，而不是传统的混合模型，但通常是长期的短期记忆（LSTM）模型，该模型经过跨熵标准训练，然后是序列判别训练标准，但我们认为这种常规混合模型仍然可以得到显着改善。在本文中，我们详细介绍了改善传统的混合LSTM声学模型，用于高临界性和低延迟自动语音识别。为了实现高精度，我们使用上下文层轨迹LSTM（CLTLSTM），该轨迹将时间建模和目标分类任务解开，并结合了未来的上下文帧，以获取更多信息以进行准确的声学建模。我们通过序列级别的教师学习进一步改善了培训策略。为了获得低潜伏期，我们设计了一个两个头CLTLSTM，其中一个头部的延迟为零，另一个头部的延迟较小，与LSTM相比。当接受微软6.5万小时的匿名训练数据并用180万个单词进行测试集进行培训时，提议的两头CLTLSTM模型具有拟议的训练策略的范围，与传统的LSTM声学模型相对减少了28.2 \％，具有相似的延迟。

While the community keeps promoting end-to-end models over conventional hybrid models, which usually are long short-term memory (LSTM) models trained with a cross entropy criterion followed by a sequence discriminative training criterion, we argue that such conventional hybrid models can still be significantly improved. In this paper, we detail our recent efforts to improve conventional hybrid LSTM acoustic models for high-accuracy and low-latency automatic speech recognition. To achieve high accuracy, we use a contextual layer trajectory LSTM (cltLSTM), which decouples the temporal modeling and target classification tasks, and incorporates future context frames to get more information for accurate acoustic modeling. We further improve the training strategy with sequence-level teacher-student learning. To obtain low latency, we design a two-head cltLSTM, in which one head has zero latency and the other head has a small latency, compared to an LSTM. When trained with Microsoft's 65 thousand hours of anonymized training data and evaluated with test sets with 1.8 million words, the proposed two-head cltLSTM model with the proposed training strategy yields a 28.2\% relative WER reduction over the conventional LSTM acoustic model, with a similar perceived latency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题