使用预先训练的DNN-HMM的声学模型，用于端到端口语理解的三模块建模

论文标题

使用预先训练的DNN-HMM的声学模型，用于端到端口语理解的三模块建模

Three-Module Modeling For End-to-End Spoken Language Understanding Using Pre-trained DNN-HMM-Based Acoustic-Phonetic Model

论文作者

Wang, Nick J. C., Wang, Lu, Sun, Yandan, Kang, Haimei, Zhang, Dejun

论文摘要

用语言理解（SLU），用户所说的内容转换为他/她的意图。关于端到端SLU的最新工作表明，可以通过预训练方法提高准确性。我们重新审视了Lugosch等人提出的想法。使用语音预训练和三模块建模；但是，为了简化端到端SLU模型的构建，我们用DNN-HMM混合动力自动语音识别（ASR）系统而不是从头开始训练One Eneme nn hymbrid自动语音识别（ASR）系统，用作音素模块。因此，我们仅针对单词模块微调语音，然后在单词和意图模块上应用多目标学习（MTL）来共同优化SLU性能。 MTL的意图分类错误率相对降低40％（从1.0％到0.6％）。请注意，我们的三模型模型是一种流方法。所提出的三模型建模方法的最终结果对流利的意图精度为99.4％，与Lugosch等人相比，意图错误率降低了50％。尽管我们专注于实时流媒体方法，但我们也列出了非流程方法进行比较。

In spoken language understanding (SLU), what the user says is converted to his/her intent. Recent work on end-to-end SLU has shown that accuracy can be improved via pre-training approaches. We revisit ideas presented by Lugosch et al. using speech pre-training and three-module modeling; however, to ease construction of the end-to-end SLU model, we use as our phoneme module an open-source acoustic-phonetic model from a DNN-HMM hybrid automatic speech recognition (ASR) system instead of training one from scratch. Hence we fine-tune on speech only for the word module, and we apply multi-target learning (MTL) on the word and intent modules to jointly optimize SLU performance. MTL yields a relative reduction of 40% in intent-classification error rates (from 1.0% to 0.6%). Note that our three-module model is a streaming method. The final outcome of the proposed three-module modeling approach yields an intent accuracy of 99.4% on FluentSpeech, an intent error rate reduction of 50% compared to that of Lugosch et al. Although we focus on real-time streaming methods, we also list non-streaming methods for comparison.

下载PDF全文

下载文献需遵守相关版权规定

论文标题