以实体为中心的密集通道检索，用于外部知识视觉问题回答

论文标题

以实体为中心的密集通道检索，用于外部知识视觉问题回答

Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering

论文作者

Wu, Jialin, Mooney, Raymond J.

论文摘要

大多数外部知识的视觉问题回答（OK-VQA）系统采用了两个阶段的框架，该框架首先在视觉问题上检索外部知识，然后根据检索到的内容预测答案。但是，检索到的知识通常是不足的。检索通常太笼统，无法涵盖回答问题所需的特定知识。另外，自然可用的监督（是否包含正确的答案）很弱，并且不能保证相关性。为了解决这些问题，我们提出了一个以实体为中心的检索（Enfore）模型，该模型在培训过程中提供了更强大的监督，并认识到与问题相关的实体，以帮助检索更具体的知识。实验表明，我们的执行模型在当前最大的外部知识VQA数据集上实现了OK-VQA上的出色检索性能。我们还将检索到的知识与最先进的VQA模型相结合，并在OK-VQA上实现新的最新性能。

Most Outside-Knowledge Visual Question Answering (OK-VQA) systems employ a two-stage framework that first retrieves external knowledge given the visual question and then predicts the answer based on the retrieved content. However, the retrieved knowledge is often inadequate. Retrievals are frequently too general and fail to cover specific knowledge needed to answer the question. Also, the naturally available supervision (whether the passage contains the correct answer) is weak and does not guarantee question relevancy. To address these issues, we propose an Entity-Focused Retrieval (EnFoRe) model that provides stronger supervision during training and recognizes question-relevant entities to help retrieve more specific knowledge. Experiments show that our EnFoRe model achieves superior retrieval performance on OK-VQA, the currently largest outside-knowledge VQA dataset. We also combine the retrieved knowledge with state-of-the-art VQA models, and achieve a new state-of-the-art performance on OK-VQA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题