使用视觉接地和蒙版语言建模的自我监督表示语音学习

论文标题

使用视觉接地和蒙版语言建模的自我监督表示语音学习

Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling

论文作者

Peng, Puyuan, Harwath, David

论文摘要

在本文中，我们描述了我们提交给Zerospeech 2021挑战和出色基准的提交。我们的提交基于最近提出的快速VGS模型，该模型是一个基于变压器的模型，该模型学会将原始语音波形与语义相关的图像相关联，而无需使用任何语音的任何转录。此外，我们介绍了该模型Fast-VGS+的新型扩展，该模型以多任务方式学习，除了视觉接地目标外，还具有蒙版语言建模目标。在2021年的Zerospeech上，我们表明我们的模型在ABX任务上的竞争性能，优于句法和语义任务的所有其他并发提交，并且几乎匹配了词汇任务上的最佳系统。在精湛的基准上，我们表明我们的模型也可以实现强大的性能，在某些情况下，甚至超过了流行的WAV2VEC2.0模型。

In this paper, we describe our submissions to the ZeroSpeech 2021 Challenge and SUPERB benchmark. Our submissions are based on the recently proposed FaST-VGS model, which is a Transformer-based model that learns to associate raw speech waveforms with semantically related images, all without the use of any transcriptions of the speech. Additionally, we introduce a novel extension of this model, FaST-VGS+, which is learned in a multi-task fashion with a masked language modeling objective in addition to the visual grounding objective. On ZeroSpeech 2021, we show that our models perform competitively on the ABX task, outperform all other concurrent submissions on the Syntactic and Semantic tasks, and nearly match the best system on the Lexical task. On the SUPERB benchmark, we show that our models also achieve strong performance, in some cases even outperforming the popular wav2vec2.0 model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题