通过区分表面相似的实例来克服视觉问题中的语言先验

论文标题

通过区分表面相似的实例来克服视觉问题中的语言先验

Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar Instances

论文作者

Wu, Yike, Zhao, Yu, Zhao, Shiwan, Zhang, Ying, Yuan, Xiaojie, Zhao, Guoqing, Jiang, Ning

论文摘要

尽管视觉问题回答（VQA）取得了长足的进步，但当前的VQA模型在很大程度上依赖问题类型及其相应的频繁答案（即语言先验）之间的表面相关性来做出预测，而无需真正理解输入。在这项工作中，我们使用相同的问题类型定义了培训实例，但与\ textit {表面上相似的实例}不同的答案，并将语言先验归因于VQA模型在这种情况下的混淆。为了解决这个问题，我们提出了一个新颖的培训框架，该培训框架明确鼓励VQA模型区分表面上相似的实例。具体来说，对于每个培训实例，我们首先构建一个包含其表面上相似的对应物的集合。然后，我们利用所提出的区分模块增加了答案空间中实例及其对应物之间的距离。这样，VQA模型被迫进一步关注问题类型的输入的其他部分，这有助于克服语言先验。实验结果表明，我们的方法在VQA-CP V2上实现了最先进的性能。代码可在\ href {https://github.com/wyk-nku/distinguishing-vqa.git} {sickithish-vqa}中获得。

Despite the great progress of Visual Question Answering (VQA), current VQA models heavily rely on the superficial correlation between the question type and its corresponding frequent answers (i.e., language priors) to make predictions, without really understanding the input. In this work, we define the training instances with the same question type but different answers as \textit{superficially similar instances}, and attribute the language priors to the confusion of VQA model on such instances. To solve this problem, we propose a novel training framework that explicitly encourages the VQA model to distinguish between the superficially similar instances. Specifically, for each training instance, we first construct a set that contains its superficially similar counterparts. Then we exploit the proposed distinguishing module to increase the distance between the instance and its counterparts in the answer space. In this way, the VQA model is forced to further focus on the other parts of the input beyond the question type, which helps to overcome the language priors. Experimental results show that our method achieves the state-of-the-art performance on VQA-CP v2. Codes are available at \href{https://github.com/wyk-nku/Distinguishing-VQA.git}{Distinguishing-VQA}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题