M-SpeechClip：利用大规模的预训练模型多语言语音来图像检索

论文标题

M-SpeechClip：利用大规模的预训练模型多语言语音来图像检索

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

论文作者

Berry, Layne, Shih, Yi-Jen, Wang, Hsuan-Fu, Chang, Heng-Jui, Lee, Hung-yi, Harwath, David

论文摘要

这项工作研究了大规模的，仅英语的预培训模型（剪辑和休伯特）用于多语言图像语音检索。对于非英语图像语言检索，我们在为每种语言的单独模型训练单独的模型时，并以单个模型来处理以所有三种语言的语音进行训练。我们确定了英语和非英语环境之间模型行为和性能的关键差异，这归因于仅英语的剪辑和休伯特的预先培训，并研究了预先训练的模型如何影响这些差异。最后，我们证明我们的模型可以用于单语语音文本检索和跨语性语音语音检索，尽管在培训过程中从未见过任何平行的语音文本或语音语音数据。

This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval. For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages. We identify key differences in model behavior and performance between English and non-English settings, attributable to the English-only pre-training of CLIP and HuBERT, and investigate how fine-tuning the pre-trained models impacts these differences. Finally, we show that our models can be used for mono- and cross-lingual speech-text retrieval and cross-lingual speech-speech retrieval, despite never having seen any parallel speech-text or speech-speech data during training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题