论文标题
通过鼓励一致的基于梯度的解释来改善视觉基础
Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations
论文作者
论文摘要
我们提出了一种基于保证金的损失,用于调整关节视觉模型,以便其基于梯度的解释与人类为相对较小的接地数据集提供的区域级注释一致。我们将该目标称为注意掩模的一致性(AMC),并证明它比以前依靠使用视觉模型来评分对象探测器的输出的方法产生了优越的视觉接地结果。尤其是,在标准视觉建模目标之上训练有AMC的模型在FlickR30K视觉接地基准中获得了86.49%的最新精度,与在同一监督水平下训练的最佳先前模型相比,绝对改善了5.38%。我们的方法在既定的基准中都表现出色,以引用表达理解,在易于测试中,它可以获得80.34%的准确性,而在困难拆分中获得了64.55%。 AMC有效,易于实现,并且是一般的,因为任何视觉模型都可以采用它,并且可以使用任何类型的区域注释。
We propose a margin-based loss for tuning joint vision-language models so that their gradient-based explanations are consistent with region-level annotations provided by humans for relatively smaller grounding datasets. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding results than previous methods that rely on using vision-language models to score the outputs of object detectors. Particularly, a model trained with AMC on top of standard vision-language modeling objectives obtains a state-of-the-art accuracy of 86.49% in the Flickr30k visual grounding benchmark, an absolute improvement of 5.38% when compared to the best previous model trained under the same level of supervision. Our approach also performs exceedingly well on established benchmarks for referring expression comprehension where it obtains 80.34% accuracy in the easy test of RefCOCO+, and 64.55% in the difficult split. AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model, and can use any type of region annotations.