通过音素序列和ASR假设之间的互相注意来构建强大的口语理解

论文标题

通过音素序列和ASR假设之间的互相注意来构建强大的口语理解

Building Robust Spoken Language Understanding by Cross Attention between Phoneme Sequence and ASR Hypothesis

论文作者

Wang, Zexun, Le, Yuquan, Zhu, Yi, Zhao, Yuming, Feng, Mingchao, Chen, Meng, He, Xiaodong

论文摘要

对自动语音识别（ASR）错误构建口语理解（SLU）对于各种支持语音的虚拟助手来说是一个必要的问题。考虑到大多数ASR误差是由类似表达式之间的语音混乱引起的，因此利用语音的音素序列可以补充ASR假设并增强SLU的鲁棒性。本文提出了一个新型模型，引起了SLU（称为Caslu）的互相关注。设计了交叉注意块是为了捕获音素和单词嵌入之间的细粒度相互作用，以使联合表示同时捕获输入的语音和语义特征，并克服下游自然语言理解（NLU）任务中的ASR错误。在三个数据集上进行了广泛的实验，显示了我们方法的有效性和竞争力。此外，我们还验证了Caslu的普遍性，并在与其他强大的SLU技术结合时证明了它的互补性。

Building Spoken Language Understanding (SLU) robust to Automatic Speech Recognition (ASR) errors is an essential issue for various voice-enabled virtual assistants. Considering that most ASR errors are caused by phonetic confusion between similar-sounding expressions, intuitively, leveraging the phoneme sequence of speech can complement ASR hypothesis and enhance the robustness of SLU. This paper proposes a novel model with Cross Attention for SLU (denoted as CASLU). The cross attention block is devised to catch the fine-grained interactions between phoneme and word embeddings in order to make the joint representations catch the phonetic and semantic features of input simultaneously and for overcoming the ASR errors in downstream natural language understanding (NLU) tasks. Extensive experiments are conducted on three datasets, showing the effectiveness and competitiveness of our approach. Additionally, We also validate the universality of CASLU and prove its complementarity when combining with other robust SLU techniques.

下载PDF全文

下载文献需遵守相关版权规定

论文标题