观看新闻：迈向可以阅读的videoqa模型

论文标题

观看新闻：迈向可以阅读的videoqa模型

Watching the News: Towards VideoQA Models that can Read

论文作者

Jahagirdar, Soumya, Mathew, Minesh, Karatzas, Dimosthenis, Jawahar, C. V.

论文摘要

视频问题回答方法着重于常识性推理和视觉认知对象或人的互动。当前的VideoQA方法忽略了视频中存在的文本信息。取而代之的是，我们认为文本信息是对动作的互补的，并为推理过程提供了基本的上下文化提示。为此，我们提出了一个新颖的视频QA任务，需要阅读和理解视频中的文本。为了探索这个方向，我们专注于新闻视频，并要求质量保证系统理解和回答有关通过视频中的视觉和文本提示提出的主题的问题。我们介绍了``newsvideoqa''数据集，该数据集包含$ 8,600 $ QA Pairs的$ 3,000+$ $新闻视频，从来自世界各地的各种新闻频道获得。我们演示了当前场景文本VQA和VideoQA方法的局限性，并提出了将场景文本信息纳入VideoQA方法的方法。

Video Question Answering methods focus on commonsense reasoning and visual cognition of objects or persons and their interactions over time. Current VideoQA approaches ignore the textual information present in the video. Instead, we argue that textual information is complementary to the action and provides essential contextualisation cues to the reasoning process. To this end, we propose a novel VideoQA task that requires reading and understanding the text in the video. To explore this direction, we focus on news videos and require QA systems to comprehend and answer questions about the topics presented by combining visual and textual cues in the video. We introduce the ``NewsVideoQA'' dataset that comprises more than $8,600$ QA pairs on $3,000+$ news videos obtained from diverse news channels from around the world. We demonstrate the limitations of current Scene Text VQA and VideoQA methods and propose ways to incorporate scene text information into VideoQA methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题