论文标题
调整剪辑以无需进一步培训即可进行短语本地化
Adapting CLIP For Phrase Localization Without Further Training
论文作者
论文摘要
措辞定位(文本接地)的监督或弱监督方法依赖于人类注释或其他某些监督模型,例如对象探测器。获得这些注释是劳动密集型的,在实践中可能难以扩展。我们建议利用对比的语言模型,剪辑的最新进展,在图像和字幕对中进行了预先训练,并从互联网收集。夹子以其原始形式仅输出图像级嵌入而无需任何空间分辨率。我们适应剪辑以生成高分辨率的空间特征图。重要的是,我们可以从VIT和RESNET夹模型中提取特征图,同时保持图像嵌入的语义特性。这为短语本地化提供了自然框架。我们的短语定位方法不需要人类注释或额外的培训。广泛的实验表明,我们的方法优于零弹性短语定位中现有的无训练方法,在某些情况下,它甚至优于监督方法。代码可在https://github.com/pals-ttic/adapting-clip上找到。
Supervised or weakly supervised methods for phrase localization (textual grounding) either rely on human annotations or some other supervised models, e.g., object detectors. Obtaining these annotations is labor-intensive and may be difficult to scale in practice. We propose to leverage recent advances in contrastive language-vision models, CLIP, pre-trained on image and caption pairs collected from the internet. In its original form, CLIP only outputs an image-level embedding without any spatial resolution. We adapt CLIP to generate high-resolution spatial feature maps. Importantly, we can extract feature maps from both ViT and ResNet CLIP model while maintaining the semantic properties of an image embedding. This provides a natural framework for phrase localization. Our method for phrase localization requires no human annotations or additional training. Extensive experiments show that our method outperforms existing no-training methods in zero-shot phrase localization, and in some cases, it even outperforms supervised methods. Code is available at https://github.com/pals-ttic/adapting-CLIP .