估计VQA答案空间的语义结构

论文标题

估计VQA答案空间的语义结构

Estimating semantic structure for the VQA answer space

论文作者

Kervadec, Corentin, Antipov, Grigory, Baccouche, Moez, Wolf, Christian

论文摘要

自从外观以来，视觉问题回答（即VQA，即通过图像提出的问题）一直被视为一组预定义答案的分类问题。尽管有方便，这种分类方法也很差反映了问题的语义，即限制了独立建议之间选择的答案，而无需考虑它们之间的相似性（例如，对回答猫或德国牧羊人而不是狗而不是同样受到惩罚）。我们通过提出（1）VQA类之间的邻近度和（2）相应的损失来解决此问题，该措施考虑了估计的接近度。这可以通过减少VQA模型的语言偏见来大大改善VQA模型的概括。特别是，我们表明我们的方法是完全不可静止的，因为它允许使用三种不同的VQA模型进行一致的改进。最后，通过将我们的方法与降低语言偏见方法相结合，我们报告了挑战性VQAV2-CP数据集的SOTA级效果。

Since its appearance, Visual Question Answering (VQA, i.e. answering a question posed over an image), has always been treated as a classification problem over a set of predefined answers. Despite its convenience, this classification approach poorly reflects the semantics of the problem limiting the answering to a choice between independent proposals, without taking into account the similarity between them (e.g. equally penalizing for answering cat or German shepherd instead of dog). We address this issue by proposing (1) two measures of proximity between VQA classes, and (2) a corresponding loss which takes into account the estimated proximity. This significantly improves the generalization of VQA models by reducing their language bias. In particular, we show that our approach is completely model-agnostic since it allows consistent improvements with three different VQA models. Finally, by combining our method with a language bias reduction approach, we report SOTA-level performance on the challenging VQAv2-CP dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题