论文标题

ASR的联合编码器解码器自我监督预训练

Joint Encoder-Decoder Self-Supervised Pre-training for ASR

论文作者

A, Arunkumar, S, Umesh

论文摘要

自我监督学习(SSL)在各种与语音有关的下游任务(包括自动语音识别(ASR))中表现出巨大的成功。 SSL模型的输出嵌入被视为语音信号的强大短期表示。但是,在ASR任务中,主要目标是获得正确的声学单元,字符或字节对编码(BPE)的正确序列。通常,对于ASR等序列到序列任务,编码器解码器架构非常有效。因此,在本文中,我们提出了一个新的范式,该范式在自学学习过程中利用解码器的力量。我们使用隐藏的单位Bert(Hubert)SSL框架来计算编码器的常规掩盖预测损失。此外,我们在SSL框架中引入了解码器,并为解码器提出了目标准备策略。最后,我们使用多任务SSL设置,其中我们共同优化编码器和解码器损耗。我们假设SSL模型中的解码器的存在有助于它学习基于声学单元的语言模型,这可能会改善ASR下游任务的性能。我们将我们提出的SSL模型与Hubert进行了比较,并通过对各种Librispeech子集进行填充来表现出高达25%的ASR性能相对提高。

Self-supervised learning (SSL) has shown tremendous success in various speech-related downstream tasks, including Automatic Speech Recognition (ASR). The output embeddings of the SSL model are treated as powerful short-time representations of the speech signal. However, in the ASR task, the main objective is to get the correct sequence of acoustic units, characters, or byte-pair encodings (BPEs). Usually, encoder-decoder architecture works exceptionally well for a sequence-to-sequence task like ASR. Therefore, in this paper, we propose a new paradigm that exploits the power of a decoder during self-supervised learning. We use Hidden Unit BERT (HuBERT) SSL framework to compute the conventional masked prediction loss for the encoder. In addition, we have introduced a decoder in the SSL framework and proposed a target preparation strategy for the decoder. Finally, we use a multitask SSL setup wherein we jointly optimize both the encoder and decoder losses. We hypothesize that the presence of a decoder in the SSL model helps it learn an acoustic unit-based language model, which might improve the performance of an ASR downstream task. We compare our proposed SSL model with HuBERT and show up to 25% relative improvement in performance on ASR by finetuning on various LibriSpeech subsets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源