将语义引入语音编码器中

论文标题

将语义引入语音编码器中

Introducing Semantics into Speech Encoders

论文作者

Xu, Derek, Dong, Shuyan, Wang, Changhan, Kim, Suyoun, Lin, Zhaojiang, Shrivastava, Akshat, Li, Shang-Wen, Tseng, Liang-Hsuan, Baevski, Alexei, Lin, Guan-Ting, Lee, Hung-yi, Sun, Yizhou, Wang, Wei

论文摘要

最近的研究发现，现有的自我监督语音编码器主要包含声学而不是语义信息。结果，对大型语言模型（LLM）系统的管道监督自动语音识别（ASR）通过利用LLM的丰富语义表示来实现语义口语任务的最新结果。这些系统以标记的音频转录为代价，这是昂贵且耗时的。我们提出了一种任务无关的方式，将LLMS的语义信息纳入自我监督的语音编码器，而无需标记音频转录。通过介绍语义，我们将现有的语音编码语言理解性能提高了10 \％的意图分类，并在指定的实体分辨率和插槽填充中获得了适度的收益，并将FF1得分的口头问题提高了2 \％。我们的无监督方法的性能与在超过100个小时的标记音频成绩单上训练的有监督方法相似，这证明了对现有语音编码者无监督语义增强的可行性。

Recent studies find existing self-supervised speech encoders contain primarily acoustic rather than semantic information. As a result, pipelined supervised automatic speech recognition (ASR) to large language model (LLM) systems achieve state-of-the-art results on semantic spoken language tasks by utilizing rich semantic representations from the LLM. These systems come at the cost of labeled audio transcriptions, which is expensive and time-consuming to obtain. We propose a task-agnostic unsupervised way of incorporating semantic information from LLMs into self-supervised speech encoders without labeled audio transcriptions. By introducing semantics, we improve existing speech encoder spoken language understanding performance by over 10\% on intent classification, with modest gains in named entity resolution and slot filling, and spoken question answering FF1 score by over 2\%. Our unsupervised approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts, demonstrating the feasibility of unsupervised semantic augmentations to existing speech encoders.

下载PDF全文

下载文献需遵守相关版权规定

论文标题