细粒度的视觉范围

论文标题

细粒度的视觉范围

Fine-Grained Visual Entailment

论文作者

Thomas, Christopher, Zhang, Yipeng, Chang, Shih-Fu

论文摘要

视觉构成是最近提出的多模式推理任务，其目标是预测文本与图像的逻辑关系。在本文中，我们提出了该任务的扩展，目的是预测文本中细粒度知识元素与图像的逻辑关系。与先前的工作不同，我们的方法可以天生解释，并且可以在不同级别的粒度上进行逻辑预测。因为我们缺乏训练方法的细粒标签，所以我们提出了一种新型的多企业学习方法，该方法仅使用样本级别的监督来学习精细的标签。我们还施加了新颖的语义结构约束，确保细粒度的预测在内部在语义上是一致的。我们在新的手动注释知识元素的新数据集上评估了我们的方法，并表明我们的方法在这项具有挑战性的任务中达到了68.18 \％的准确性，同时大大胜过了几个强大的基准。最后，我们提出了广泛的定性结果，说明了我们方法的预测以及我们方法所依赖的视觉证据。我们的代码和注释数据集可以在此处找到：https：//github.com/skrighyz/fgve。

Visual entailment is a recently proposed multimodal reasoning task where the goal is to predict the logical relationship of a piece of text to an image. In this paper, we propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image. Unlike prior work, our method is inherently explainable and makes logical predictions at different levels of granularity. Because we lack fine-grained labels to train our method, we propose a novel multi-instance learning approach which learns a fine-grained labeling using only sample-level supervision. We also impose novel semantic structural constraints which ensure that fine-grained predictions are internally semantically consistent. We evaluate our method on a new dataset of manually annotated knowledge elements and show that our method achieves 68.18\% accuracy at this challenging task while significantly outperforming several strong baselines. Finally, we present extensive qualitative results illustrating our method's predictions and the visual evidence our method relied on. Our code and annotated dataset can be found here: https://github.com/SkrighYZ/FGVE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题