论文标题
弱监督短语接地的对比度学习
Contrastive Learning for Weakly Supervised Phrase Grounding
论文作者
论文摘要
短语接地是将图像区域与字幕单词关联的问题,是视觉任务的关键组成部分。我们表明,可以通过优化单词区域的注意来学习短语接地,以最大程度地提高图像和字幕单词之间的相互信息的下限。给定的图像和标题对,我们最大程度地提高了注意力加权区域的兼容性和相应字幕中的单词,与图像和标题的无对应对相比。一个关键思想是构建有效的负面标题,以通过语言模型指导单词替换来学习。与我们的负面培训相对于培训数据中随机采样的负面因素,准确性获得了$ \ sim10 \%$的绝对增益。我们对可可扣培训的弱监督短语接地模型显示,在Flickr30k实体基准上,获得了$ 5.7 \%$ $ 5.7 \%$的健康增益。
Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions. A key idea is to construct effective negative captions for learning through language model guided word substitutions. Training with our negatives yields a $\sim10\%$ absolute gain in accuracy over randomly-sampled negatives from the training data. Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of $5.7\%$ to achieve $76.7\%$ accuracy on Flickr30K Entities benchmark.