与无监督的ASR桥接语音和文本预培训模型

论文标题

与无监督的ASR桥接语音和文本预培训模型

Bridging Speech and Textual Pre-trained Models with Unsupervised ASR

论文作者

Shi, Jiatong, Hsu, Chan-Jan, Chung, Holam, Gao, Dongji, Garcia, Paola, Watanabe, Shinji, Lee, Ann, Lee, Hung-yi

论文摘要

口语理解（SLU）是一项旨在从口语话语中提取高级语义的任务。先前的作品研究了语音自我监督模型和文本预培训模型的使用，这些模型对各种SLU任务显示了合理的改进。但是，由于语音信号和文本令牌之间的模式不匹配，因此以前的方法通常需要框架的复杂设计。这项工作提出了一个简单而有效的无监督范式，该范式将语音和文本预训练的模型连接起来，从而为SLU中的各种任务提供了无监督的语音到语义的预训练模型。具体来说，我们建议使用无监督的自动语音识别（ASR）作为连接器，该连接器桥接了语音和文本预训练模型中使用的不同方式。我们的实验表明，无监督的ASR本身可以改善语音自学模型的表示。更重要的是，它显示为语音和文本预训练模型之间的有效连接器，从而改善了五个不同SLU任务的性能。值得注意的是，在口头答案中，我们在具有挑战性的NMSQA基准中达到了最先进的结果。

Spoken language understanding (SLU) is a task aiming to extract high-level semantics from spoken utterances. Previous works have investigated the use of speech self-supervised models and textual pre-trained models, which have shown reasonable improvements to various SLU tasks. However, because of the mismatched modalities between speech signals and text tokens, previous methods usually need complex designs of the frameworks. This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models, resulting in an unsupervised speech-to-semantic pre-trained model for various tasks in SLU. To be specific, we propose to use unsupervised automatic speech recognition (ASR) as a connector that bridges different modalities used in speech and textual pre-trained models. Our experiments show that unsupervised ASR itself can improve the representations from speech self-supervised models. More importantly, it is shown as an efficient connector between speech and textual pre-trained models, improving the performances of five different SLU tasks. Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题