在上下文中检查答案：一个多模式的完全注意网络，用于视觉问题答案

论文标题

在上下文中检查答案：一个多模式的完全注意网络，用于视觉问题答案

Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering

论文作者

Huang, Hantao, Han, Tao, Han, Wei, Yap, Deep, Chiang, Cheng-Ming

论文摘要

由于复杂的跨模式关系，视觉问题回答（VQA）具有挑战性。它受到了研究界的广泛关注。从人类的角度来看，要回答一个视觉问题，人们需要阅读该问题，然后参考图像以生成答案。然后，将对问题和图像进行检查以进行最终确认。在本文中，我们模仿了这个过程，并提出了完全基于注意力的VQA体系结构。此外，提出了一个答案检查模块，以对共同答案，问题和图像表示形式进行统一的关注以更新答案。这模仿了人类的答案检查过程，以在上下文中考虑答案。通过回答检查模块并传输了BERT层，我们的模型使用VQA-V2.0测试标准分配中的较少参数实现了最先进的精度71.57 \％。

Visual Question Answering (VQA) is challenging due to the complex cross-modal relations. It has received extensive attention from the research community. From the human perspective, to answer a visual question, one needs to read the question and then refer to the image to generate an answer. This answer will then be checked against the question and image again for the final confirmation. In this paper, we mimic this process and propose a fully attention based VQA architecture. Moreover, an answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation to update the answer. This mimics the human answer checking process to consider the answer in the context. With answer-checking modules and transferred BERT layers, our model achieves the state-of-the-art accuracy 71.57\% using fewer parameters on VQA-v2.0 test-standard split.

下载PDF全文

下载文献需遵守相关版权规定

论文标题