论文标题
将语义引入语音编码器中
Introducing Semantics into Speech Encoders
论文作者
论文摘要
最近的研究发现,现有的自我监督语音编码器主要包含声学而不是语义信息。结果,对大型语言模型(LLM)系统的管道监督自动语音识别(ASR)通过利用LLM的丰富语义表示来实现语义口语任务的最新结果。这些系统以标记的音频转录为代价,这是昂贵且耗时的。我们提出了一种任务无关的方式,将LLMS的语义信息纳入自我监督的语音编码器,而无需标记音频转录。通过介绍语义,我们将现有的语音编码语言理解性能提高了10 \%的意图分类,并在指定的实体分辨率和插槽填充中获得了适度的收益,并将FF1得分的口头问题提高了2 \%。我们的无监督方法的性能与在超过100个小时的标记音频成绩单上训练的有监督方法相似,这证明了对现有语音编码者无监督语义增强的可行性。
Recent studies find existing self-supervised speech encoders contain primarily acoustic rather than semantic information. As a result, pipelined supervised automatic speech recognition (ASR) to large language model (LLM) systems achieve state-of-the-art results on semantic spoken language tasks by utilizing rich semantic representations from the LLM. These systems come at the cost of labeled audio transcriptions, which is expensive and time-consuming to obtain. We propose a task-agnostic unsupervised way of incorporating semantic information from LLMs into self-supervised speech encoders without labeled audio transcriptions. By introducing semantics, we improve existing speech encoder spoken language understanding performance by over 10\% on intent classification, with modest gains in named entity resolution and slot filling, and spoken question answering FF1 score by over 2\%. Our unsupervised approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts, demonstrating the feasibility of unsupervised semantic augmentations to existing speech encoders.