视频问题通过迭代视频文本共同回答

论文标题

视频问题通过迭代视频文本共同回答

Video Question Answering with Iterative Video-Text Co-Tokenization

论文作者

Piergiovanni, AJ, Morton, Kairo, Kuo, Weicheng, Ryoo, Michael S., Angelova, Anelia

论文摘要

视频问题回答是一项具有挑战性的任务，需要共同了解语言输入，单个视频帧中的视觉信息以及视频中发生的事件的时间信息。在本文中，我们提出了一种新颖的多流视频编码器，用于视频问题回答，它使用了多个视频输入和一种新的视频文本迭代迭代共同言论方法来回答与视频相关的各种问题。我们在几个数据集上进行了实验评估该模型，例如MSRVTT-QA，MSVD-QA，IVQA，超过了大幅度的先前最新技术。同时，我们的模型将所需的Gflops从150-360减少到只有67，从而产生了高效的视频答案模型。

Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization approach to answer a variety of questions related to videos. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA, outperforming the previous state-of-the-art by large margins. Simultaneously, our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题