薰衣草：将视频语言理解统一为蒙版语言建模

论文标题

薰衣草：将视频语言理解统一为蒙版语言建模

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

论文作者

Li, Linjie, Gan, Zhe, Lin, Kevin, Lin, Chung-Ching, Liu, Zicheng, Liu, Ce, Wang, Lijuan

论文摘要

近年来，统一的视觉语言框架已经大大提高，其中大多数采用编码器架构将图像文本任务统一为序列到序列的生成。但是，现有的视频语言（VIDL）模型仍需要在每个任务的模型架构和培训目标中进行特定于任务的设计。在这项工作中，我们探索了一个统一的VIDL框架薰衣草，其中蒙版语言建模（MLM）用作所有前训练和下游任务的常见接口。这种统一导致了简化的模型体系结构，在多模式编码器的顶部，只需要一个轻巧的MLM头，而不是具有更多参数的解码器。令人惊讶的是，实验结果表明，这个统一的框架在14个VIDL基准测试中实现了竞争性能，涵盖了视频问答，文本到视频检索和视频字幕。广泛的分析进一步证明了薰衣草比现有VIDL方法的优势：（i）在多任务列出时仅使用一组参数值支持所有下游任务；（ii）在各种下游任务上进行了几次概括；（iii）在视频问题回答任务上启用零射门评估。代码可在https://github.com/microsoft/lavender上找到。

Unified vision-language frameworks have greatly advanced in recent years, most of which adopt an encoder-decoder architecture to unify image-text tasks as sequence-to-sequence generation. However, existing video-language (VidL) models still require task-specific designs in model architecture and training objectives for each task. In this work, we explore a unified VidL framework LAVENDER, where Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks. Such unification leads to a simplified model architecture, where only a lightweight MLM head, instead of a decoder with much more parameters, is needed on top of the multimodal encoder. Surprisingly, experimental results show that this unified framework achieves competitive performance on 14 VidL benchmarks, covering video question answering, text-to-video retrieval and video captioning. Extensive analyses further demonstrate the advantage of LAVENDER over existing VidL methods in: (i) supporting all downstream tasks with just a single set of parameter values when multi-task finetuned; (ii) few-shot generalization on various downstream tasks; and (iii) enabling zero-shot evaluation on video question answering tasks. Code is available at https://github.com/microsoft/LAVENDER.

下载PDF全文

下载文献需遵守相关版权规定

论文标题