ASR的联合编码器解码器自我监督预训练

论文标题

ASR的联合编码器解码器自我监督预训练

Joint Encoder-Decoder Self-Supervised Pre-training for ASR

论文作者

A, Arunkumar, S, Umesh

论文摘要

自我监督学习（SSL）在各种与语音有关的下游任务（包括自动语音识别（ASR））中表现出巨大的成功。 SSL模型的输出嵌入被视为语音信号的强大短期表示。但是，在ASR任务中，主要目标是获得正确的声学单元，字符或字节对编码（BPE）的正确序列。通常，对于ASR等序列到序列任务，编码器解码器架构非常有效。因此，在本文中，我们提出了一个新的范式，该范式在自学学习过程中利用解码器的力量。我们使用隐藏的单位Bert（Hubert）SSL框架来计算编码器的常规掩盖预测损失。此外，我们在SSL框架中引入了解码器，并为解码器提出了目标准备策略。最后，我们使用多任务SSL设置，其中我们共同优化编码器和解码器损耗。我们假设SSL模型中的解码器的存在有助于它学习基于声学单元的语言模型，这可能会改善ASR下游任务的性能。我们将我们提出的SSL模型与Hubert进行了比较，并通过对各种Librispeech子集进行填充来表现出高达25％的ASR性能相对提高。

Self-supervised learning (SSL) has shown tremendous success in various speech-related downstream tasks, including Automatic Speech Recognition (ASR). The output embeddings of the SSL model are treated as powerful short-time representations of the speech signal. However, in the ASR task, the main objective is to get the correct sequence of acoustic units, characters, or byte-pair encodings (BPEs). Usually, encoder-decoder architecture works exceptionally well for a sequence-to-sequence task like ASR. Therefore, in this paper, we propose a new paradigm that exploits the power of a decoder during self-supervised learning. We use Hidden Unit BERT (HuBERT) SSL framework to compute the conventional masked prediction loss for the encoder. In addition, we have introduced a decoder in the SSL framework and proposed a target preparation strategy for the decoder. Finally, we use a multitask SSL setup wherein we jointly optimize both the encoder and decoder losses. We hypothesize that the presence of a decoder in the SSL model helps it learn an acoustic unit-based language model, which might improve the performance of an ASR downstream task. We compare our proposed SSL model with HuBERT and show up to 25% relative improvement in performance on ASR by finetuning on various LibriSpeech subsets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题