嫁接多模式标题生成的预训练模型

论文标题

嫁接多模式标题生成的预训练模型

Grafting Pre-trained Models for Multimodal Headline Generation

论文作者

Qiao, Lingfeng, Wu, Chen, Liu, Ye, Peng, Haoyuan, Yin, Di, Ren, Bo

论文摘要

多模式标题利用视频框架和成绩单来生成视频的自然语言标题。由于缺乏大规模，手动注释的数据，注释视频的注释扎根的任务是劳动力密集的和不切实际的。先前关于预训练的语言模型和视频语言模型的研究在相关下游任务方面取得了重大进展。但是，它们都不能直接应用于多模式的标题体系结构，我们需要多模式编码器和句子解码器。简单地粘合语言模型和视频语言模型的主要挑战是模态平衡，旨在结合视觉语言互补能力。在本文中，我们提出了一种新颖的方法，将视频编码器从预先训练的视频语言模型中移植到生成性预训练的语言模型上。我们还提出了一种通过间/内形式关系的共识融合机制，用于整合不同组件。从经验上讲，实验表明，嫁接的模型在从现实世界应用中收集的全新数据集上取得了强大的结果。

Multimodal headline utilizes both video frames and transcripts to generate the natural language title of the videos. Due to a lack of large-scale, manually annotated data, the task of annotating grounded headlines for video is labor intensive and impractical. Previous researches on pre-trained language models and video-language models have achieved significant progress in related downstream tasks. However, none of them can be directly applied to multimodal headline architecture where we need both multimodal encoder and sentence decoder. A major challenge in simply gluing language model and video-language model is the modality balance, which is aimed at combining visual-language complementary abilities. In this paper, we propose a novel approach to graft the video encoder from the pre-trained video-language model on the generative pre-trained language model. We also present a consensus fusion mechanism for the integration of different components, via inter/intra modality relation. Empirically, experiments show that the grafted model achieves strong results on a brand-new dataset collected from real-world applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题