调整剪辑以无需进一步培训即可进行短语本地化

论文标题

调整剪辑以无需进一步培训即可进行短语本地化

Adapting CLIP For Phrase Localization Without Further Training

论文作者

Li, Jiahao, Shakhnarovich, Greg, Yeh, Raymond A.

论文摘要

措辞定位（文本接地）的监督或弱监督方法依赖于人类注释或其他某些监督模型，例如对象探测器。获得这些注释是劳动密集型的，在实践中可能难以扩展。我们建议利用对比的语言模型，剪辑的最新进展，在图像和字幕对中进行了预先训练，并从互联网收集。夹子以其原始形式仅输出图像级嵌入而无需任何空间分辨率。我们适应剪辑以生成高分辨率的空间特征图。重要的是，我们可以从VIT和RESNET夹模型中提取特征图，同时保持图像嵌入的语义特性。这为短语本地化提供了自然框架。我们的短语定位方法不需要人类注释或额外的培训。广泛的实验表明，我们的方法优于零弹性短语定位中现有的无训练方法，在某些情况下，它甚至优于监督方法。代码可在https://github.com/pals-ttic/adapting-clip上找到。

Supervised or weakly supervised methods for phrase localization (textual grounding) either rely on human annotations or some other supervised models, e.g., object detectors. Obtaining these annotations is labor-intensive and may be difficult to scale in practice. We propose to leverage recent advances in contrastive language-vision models, CLIP, pre-trained on image and caption pairs collected from the internet. In its original form, CLIP only outputs an image-level embedding without any spatial resolution. We adapt CLIP to generate high-resolution spatial feature maps. Importantly, we can extract feature maps from both ViT and ResNet CLIP model while maintaining the semantic properties of an image embedding. This provides a natural framework for phrase localization. Our method for phrase localization requires no human annotations or additional training. Extensive experiments show that our method outperforms existing no-training methods in zero-shot phrase localization, and in some cases, it even outperforms supervised methods. Code is available at https://github.com/pals-ttic/adapting-CLIP .

下载PDF全文

下载文献需遵守相关版权规定

论文标题