端到端的语音到划分识别

论文标题

端到端的语音到划分识别

End-to-end speech-to-dialog-act recognition

论文作者

Dang, Viet-Trung, Zhao, Tianyu, Ueno, Sei, Inaguma, Hirofumi, Kawahara, Tatsuya

论文摘要

语言理解是在传统上提取意图和/或语义概念的，通常是作为自动语音识别的后处理。它通常接受Oracle成绩单培训，但需要处理ASR的错误。此外，存在与意图相关的声学特征，但与转录本没有代表。在本文中，我们提出了一个端到端模型，该模型直接将语音转换为对话框而无需确定性转录过程。在提出的模型中，对话框ACT识别网络与SoftMax层之前的潜在层与声学到单词的ASR模型相结合，该层提供了单词级ASR解码信息的分布式表示。然后，整个网络以端到端的方式进行微调。这允许对ASR错误进行稳定的培训以及鲁棒性。该模型进一步扩展以共同进行DA分割。使用The The Decterboard语料库的评估表明，所提出的方法可显着提高对话ACT ACT识别准确性，从传统的管道框架中。

Spoken language understanding, which extracts intents and/or semantic concepts in utterances, is conventionally formulated as a post-processing of automatic speech recognition. It is usually trained with oracle transcripts, but needs to deal with errors by ASR. Moreover, there are acoustic features which are related with intents but not represented with the transcripts. In this paper, we present an end-to-end model which directly converts speech into dialog acts without the deterministic transcription process. In the proposed model, the dialog act recognition network is conjunct with an acoustic-to-word ASR model at its latent layer before the softmax layer, which provides a distributed representation of word-level ASR decoding information. Then, the entire network is fine-tuned in an end-to-end manner. This allows for stable training as well as robustness against ASR errors. The model is further extended to conduct DA segmentation jointly. Evaluations with the Switchboard corpus demonstrate that the proposed method significantly improves dialog act recognition accuracy from the conventional pipeline framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题