论文标题
GIF的自动捕获:用于视觉语言预训练的大型视频句子数据集
Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training
论文作者
论文摘要
在这项工作中,我们介绍了GIF的自动捕获,这是一个新的大规模预训练数据集,可用于通用视频理解。所有视频句子对都是通过从数十亿个网页中自动提取和过滤视频字幕注释来创建的。可以利用GIF数据集的自动捕获来预先培训用于视频字幕的通用功能表示或编码器编码器结构,以及其他下游任务(例如,视频中的句子本地化,视频问题响应等)。与现有视频句号数据集相比,我们对GIF数据集的自动捕获量进行了详细的分析。我们还为视觉预训练提供了基于变压器的编码器结构的评估,该结构进一步适应了视频字幕下游任务,并在MSR-VTT上产生了令人信服的普遍性。该数据集可在\ url {http://www.auto-video-captions.top/2020/dataset}中获得。
In this work, we present Auto-captions on GIF, which is a new large-scale pre-training dataset for generic video understanding. All video-sentence pairs are created by automatically extracting and filtering video caption annotations from billions of web pages. Auto-captions on GIF dataset can be utilized to pre-train the generic feature representation or encoder-decoder structure for video captioning, and other downstream tasks (e.g., sentence localization in videos, video question answering, etc.) as well. We present a detailed analysis of Auto-captions on GIF dataset in comparison to existing video-sentence datasets. We also provide an evaluation of a Transformer-based encoder-decoder structure for vision-language pre-training, which is further adapted to video captioning downstream task and yields the compelling generalizability on MSR-VTT. The dataset is available at \url{http://www.auto-video-captions.top/2020/dataset}.