VQA视觉底层的实验研究

论文标题

VQA视觉底层的实验研究

An experimental study of the vision-bottleneck in VQA

论文作者

Marza, Pierre, Kervadec, Corentin, Antipov, Grigory, Baccouche, Moez, Wolf, Christian

论文摘要

就像在将视觉和语言结合在一起的许多任务中一样，这两种模式在视觉问题回答（VQA）中都起着至关重要的作用。为了正确解决任务，给定的模型应既应该了解提议的图像的内容和问题的性质。虽然模式之间的融合（这是问题的另一个重要部分）进行了深入研究，但在最近的工作中，视觉部分受到了较少的关注。 VQA的当前最新方法主要依赖于提供一组对象边界框和嵌入的现成的对象检测器，然后通过推理模块将其与问题单词嵌入在一起。在本文中，我们提出了一项对VQA中视觉底层的深入研究，以实验从图像中提取的视觉对象的数量和质量。我们还研究了两种方法的影响，以将有关回答问题的对象，直接在推理模块中以及对象选择阶段的早期。这项工作强调了在VQA背景下视觉的重要性，以及在VQA中使用的VQA视觉方法的兴趣。

As in many tasks combining vision and language, both modalities play a crucial role in Visual Question Answering (VQA). To properly solve the task, a given model should both understand the content of the proposed image and the nature of the question. While the fusion between modalities, which is another obviously important part of the problem, has been highly studied, the vision part has received less attention in recent work. Current state-of-the-art methods for VQA mainly rely on off-the-shelf object detectors delivering a set of object bounding boxes and embeddings, which are then combined with question word embeddings through a reasoning module. In this paper, we propose an in-depth study of the vision-bottleneck in VQA, experimenting with both the quantity and quality of visual objects extracted from images. We also study the impact of two methods to incorporate the information about objects necessary for answering a question, in the reasoning module directly, and earlier in the object selection stage. This work highlights the importance of vision in the context of VQA, and the interest of tailoring vision methods used in VQA to the task at hand.

下载PDF全文

下载文献需遵守相关版权规定

论文标题