学习从网络视频回答视觉问题

论文标题

学习从网络视频回答视觉问题

Learning to Answer Visual Questions from Web Videos

论文作者

Yang, Antoine, Miech, Antoine, Sivic, Josef, Laptev, Ivan, Schmid, Cordelia

论文摘要

视觉问题回答的最新方法取决于大规模注释的数据集。但是，视频的问题和答案的手动注释乏味，昂贵，可阻止可扩展性。在这项工作中，我们建议避免手动注释，并生成一个大规模培训数据集，用于视频问题，以回答利用自动跨模式监督。我们利用了对文本数据训练的问题产生变压器的问题，并使用它来从转录的视频叙述中生成问答对。给定的叙述视频，我们将自动使用69m视频问题 - 答案三胞胎生成Howtovqa69m数据集。为了处理此数据集中各种答案的开放词汇，我们提出了一个基于视频问题多模式变压器和答案变压器之间的对比损失的培训程序。我们介绍了零摄像的VideoQA任务和VideoQA功能探针评估设置，并显示出良好的结果，特别是对于罕见答案。此外，我们的方法还可以在MSRVTT-QA，ActivityNet-QA，MSVD-QA和HOW2QA数据集上获得竞争结果。我们还表明，我们的VideoQA数据集生成方法概括为Web视频和文本数据的另一种来源。我们使用我们的方法从WebVID数据集（即带有Alt-Text注释的视频）生成WebVIDVQA3M数据集，并显示了培训VideoQA模型的好处。最后，为了进行详细的评估，我们介绍了IVQA，这是一个新的VideoQA数据集，具有减少语言偏见和高质量的手动注释。代码，数据集和受过训练的模型可在https://antoyang.github.io/just-ask.html上找到

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and the VideoQA feature probe evaluation setting and show excellent results, in particular for rare answers. Furthermore, our method achieves competitive results on MSRVTT-QA, ActivityNet-QA, MSVD-QA and How2QA datasets. We also show that our VideoQA dataset generation approach generalizes to another source of web video and text data. We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i.e., videos with alt-text annotations, and show its benefits for training VideoQA models. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations. Code, datasets and trained models are available at https://antoyang.github.io/just-ask.html

下载PDF全文

下载文献需遵守相关版权规定

论文标题