视觉问题回答的提高了关注

论文标题

视觉问题回答的提高了关注

An Improved Attention for Visual Question Answering

论文作者

Rahman, Tanzila, Chou, Shih-Han, Sigal, Leonid, Carenini, Giuseppe

论文摘要

我们考虑视觉问题回答（VQA）的问题。给定图像和自然语言表达的自由形式，开放式的问题，VQA系统的目标是为图像提供准确的答案。该任务具有挑战性，因为它需要对视觉和文本信息的同时且复杂的理解。捕获内部和模式间依赖性的注意力已经成为解决这些挑战的最广泛使用的机制。在本文中，我们提出了改进的基于注意力的架构来解决VQA。我们将注意力（AOA）模块的注意力纳入编码器框架中，该模块能够确定注意力结果和查询之间的关系。注意模块为每个查询生成加权平均值。另一方面，AOA模块首先使用注意结果和当前环境生成信息向量和注意门。然后，通过乘以两者来添加另一个关注以生成最终参与的信息。我们还提出了多模式融合模块，以结合视觉和文本信息。该融合模块的目的是动态地确定应从每种模式中考虑多少信息。 VQA-V2基准数据集的广泛实验表明，我们的方法实现了最先进的性能。

We consider the problem of Visual Question Answering (VQA). Given an image and a free-form, open-ended, question, expressed in natural language, the goal of VQA system is to provide accurate answer to this question with respect to the image. The task is challenging because it requires simultaneous and intricate understanding of both visual and textual information. Attention, which captures intra- and inter-modal dependencies, has emerged as perhaps the most widely used mechanism for addressing these challenges. In this paper, we propose an improved attention-based architecture to solve VQA. We incorporate an Attention on Attention (AoA) module within encoder-decoder framework, which is able to determine the relation between attention results and queries. Attention module generates weighted average for each query. On the other hand, AoA module first generates an information vector and an attention gate using attention results and current context; and then adds another attention to generate final attended information by multiplying the two. We also propose multimodal fusion module to combine both visual and textual information. The goal of this fusion module is to dynamically decide how much information should be considered from each modality. Extensive experiments on VQA-v2 benchmark dataset show that our method achieves the state-of-the-art performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题