对使用构象模型进行口语理解的不同方法的研究

论文标题

对使用构象模型进行口语理解的不同方法的研究

A Study of Different Ways to Use The Conformer Model For Spoken Language Understanding

论文作者

Wang, Nick J. C., Wang, Shaojun, Xiao, Jing

论文摘要

SLU结合了ASR和NLU的能力，以实现语音到实力的理解。在本文中，我们比较了将ASR和NLU相结合的不同方法，特别是使用单个构象模型与使用其组件的不同方法，以更好地理解每种方法的优势和劣势。我们发现，这不一定是确定最佳研究或应用系统的两阶段解码和端到端系统之间的选择。系统优化仍然需要仔细提高每个组件的性能。很难证明一个方向比另一个方向更好。在本文中，我们还提出了一种新颖的连接派时间摘要（CTS）方法，以减少声学编码序列的长度，同时提高端到端模型的准确性和处理速度。该方法的意图准确性与最佳的两阶段SLU识别具有相同的意图，并具有复杂且耗时的解码，但以较低的计算成本进行。这种堆叠的端到端SLU系统的意图精度为Smartlights远场设置为93.97％，近场设置为95.18％，而FluentsPeech的意图精度为99.71％。

SLU combines ASR and NLU capabilities to accomplish speech-to-intent understanding. In this paper, we compare different ways to combine ASR and NLU, in particular using a single Conformer model with different ways to use its components, to better understand the strengths and weaknesses of each approach. We find that it is not necessarily a choice between two-stage decoding and end-to-end systems which determines the best system for research or application. System optimization still entails carefully improving the performance of each component. It is difficult to prove that one direction is conclusively better than the other. In this paper, we also propose a novel connectionist temporal summarization (CTS) method to reduce the length of acoustic encoding sequences while improving the accuracy and processing speed of end-to-end models. This method achieves the same intent accuracy as the best two-stage SLU recognition with complicated and time-consuming decoding but does so at lower computational cost. This stacked end-to-end SLU system yields an intent accuracy of 93.97% for the SmartLights far-field set, 95.18% for the close-field set, and 99.71% for FluentSpeech.

下载PDF全文

下载文献需遵守相关版权规定

论文标题