VATLM：视觉审计文本预训练，并带有统一的掩盖预测语音表示学习

论文标题

VATLM：视觉审计文本预训练，并带有统一的掩盖预测语音表示学习

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

论文作者

Zhu, Qiushi, Zhou, Long, Zhang, Ziqiang, Liu, Shujie, Jiao, Binxing, Zhang, Jie, Dai, Lirong, Jiang, Daxin, Li, Jinyu, Wei, Furu

论文摘要

尽管语音是人类与外界交流的简单有效方法，但更现实的语音互动包含多模式信息，例如视觉，文本。如何设计一个统一的框架来集成不同的模态信息并利用不同的资源（例如，视觉原理对，音频读写对，未标记的语音和未标记的文本）来促进语音表示学习，但探讨了不充分探索。在本文中，我们提出了一个统一的跨模式表示框架VATLM（Visual-Audio-Text语言模型）。所提出的VATLM采用统一的骨干网络来对模式独立的信息进行建模，并利用三个简单的模态依赖性模块来预处理视觉，语音和文本输入。为了将这三种模式集成到一个共享的语义空间中，VATLM通过我们提出的统一令牌给定的统一令牌的掩盖预测任务进行了优化。我们在视听相关的下游任务上评估了预训练的VATLM，包括视听语音识别（AVSR），视觉语音识别（VSR）任务。结果表明，所提出的VATLM优于先前的最先进模型，例如视听预训练的AV-Hubert模型，分析还表明VATLM能够将不同的模态对准相同的空间。为了促进未来的研究，我们在https://aka.ms/vatlm上发布了代码和预培训模型。

Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model). The proposed VATLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens, given by our proposed unified tokenizer. We evaluate the pre-trained VATLM on audio-visual related downstream tasks, including audio-visual speech recognition (AVSR), visual speech recognition (VSR) tasks. Results show that the proposed VATLM outperforms previous the state-of-the-art models, such as audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that VATLM is capable of aligning different modalities into the same space. To facilitate future research, we release the code and pre-trained models at https://aka.ms/vatlm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题