总之：探索统一的视频语言预训练

论文标题

总之：探索统一的视频语言预训练

All in One: Exploring Unified Video-Language Pre-training

论文作者

Wang, Alex Jinpeng, Ge, Yixiao, Yan, Rui, Ge, Yuying, Lin, Xudong, Cai, Guanyu, Wu, Jianping, Shan, Ying, Qie, Xiaohu, Shou, Mike Zheng

论文摘要

主流视频语言预训练模型\ cite {actbert，clipbert，紫罗兰}由三个部分，视频编码器，文本编码器和视频文本融合变压器组成。他们通过利用较重的单形编码器或多模式融合变压器来追求更好的性能，从而增加参数，而在下游任务中效率较低。在这项工作中，我们首次引入端到端的视频语言模型，即\ textit {All-In-In-In-One Transformer}，该模型将原始视频和文本信号嵌入使用统一的骨干架构中。我们认为，视频数据的独特时间信息原来是阻碍模态 - 静态变压器设计的关键障碍。为了克服挑战，我们引入了一种新颖有效的令牌滚动操作，以非参数方式从视频剪辑中编码时间表。仔细设计可以使用统一的骨干模型对视频文本多模式输入和单峰输入进行表示。微调后，我们预先训练的多合一变压器将转移到各种下游视频文本任务中，包括文本视频检索，视频问题回答，多项选择和视觉常识性推理。九个数据集上使用最小模型拖船的最先进的表演证明了我们方法的优越性与竞争性对应物相比。代码和预估计的模型已在https://github.com/showlab/all-in-one中发布。

Mainstream Video-Language Pre-training models \cite{actbert,clipbert,violet} consist of three parts, a video encoder, a text encoder, and a video-text fusion Transformer. They pursue better performance via utilizing heavier unimodal encoders or multimodal fusion Transformers, resulting in increased parameters with lower efficiency in downstream tasks. In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture. We argue that the unique temporal information of video data turns out to be a key barrier hindering the design of a modality-agnostic Transformer. To overcome the challenge, we introduce a novel and effective token rolling operation to encode temporal representations from video clips in a non-parametric manner. The careful design enables the representation learning of both video-text multimodal inputs and unimodal inputs using a unified backbone model. Our pre-trained all-in-one Transformer is transferred to various downstream video-text tasks after fine-tuning, including text-video retrieval, video-question answering, multiple choice and visual commonsense reasoning. State-of-the-art performances with the minimal model FLOPs on nine datasets demonstrate the superiority of our method compared to the competitive counterparts. The code and pretrained model have been released in https://github.com/showlab/all-in-one.

下载PDF全文

下载文献需遵守相关版权规定

论文标题