回路：强大的零射基线，用于参考表达理解

论文标题

回路：强大的零射基线，用于参考表达理解

ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

论文作者

Subramanian, Sanjay, Merrill, William, Darrell, Trevor, Gardner, Matt, Singh, Sameer, Rohrbach, Anna

论文摘要

训练新的视觉域的引用表达理解（REC）模型需要为域中的图像收集参考表达式和潜在的相应边界框。虽然大规模预训练的模型可用于跨域的图像分类，但尚不清楚它们是否可以以零拍的方式应用于更复杂的任务等更复杂的任务。我们提出了一种简单但强大的零射基线基线，它重新利用了Rec的最先进的大型模型。由REC和CLIP的对比性预训练目标之间的紧密联系的动机，Reclip的第一个组成部分是一种区域评分方法，该方法通过裁剪和模糊来隔离对象建议，并将其传递给夹子。但是，通过对合成数据集的受控实验，我们发现剪辑在很大程度上无法进行空间推理。因此，回收的第二个组成部分是处理几种空间关系的空间关系解析器。我们在reccocog和Refgta（视频游戏图像）上，从先前工作和监督模型中的零射基线与监督模型之间的差距减少了多达29％，Retlip对在真实图像培训的监督REC模型的相对改进为8％。

Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are useful for image classification across domains, it remains unclear if they can be applied in a zero-shot manner to more complex tasks like ReC. We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC. Motivated by the close connection between ReC and CLIP's contrastive pre-training objective, the first component of ReCLIP is a region-scoring method that isolates object proposals via cropping and blurring, and passes them to CLIP. However, through controlled experiments on a synthetic dataset, we find that CLIP is largely incapable of performing spatial reasoning off-the-shelf. Thus, the second component of ReCLIP is a spatial relation resolver that handles several types of spatial relations. We reduce the gap between zero-shot baselines from prior work and supervised models by as much as 29% on RefCOCOg, and on RefGTA (video game imagery), ReCLIP's relative improvement over supervised ReC models trained on real images is 8%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题