COMCLIP：无训练的构图图像和文本匹配

论文标题

COMCLIP：无训练的构图图像和文本匹配

ComCLIP: Training-Free Compositional Image and Text Matching

论文作者

Jiang, Kenan, He, Xuehai, Xu, Ruize, Wang, Xin Eric

论文摘要

对比性语言图像预处理（剪辑）已证明了匹配图像和文本的出色零弹性性能。但是，将视觉灯笼式的模型（例如剪辑）改编成组成图像和文本匹配仍然是一项挑战，这是一个更具挑战性的图像和文本匹配任务，需要对组成单词概念和视觉组件的模型理解。在本文中，我们从因果角度研究了问题：单个实体的错误语义本质上是引起匹配失败的混杂因素。因此，我们提出了一个新颖的\ textbf {\ textit {triaghtre}}组成剪辑模型（comclip）。 COMCLIP DISENTANGLES将图像输入到主题，对象和动作子图像中，并构成了剪辑的视觉编码器和文本编码器，以对构图文本嵌入和子图像嵌入进行不断发展的匹配。通过这种方式，COMCLIP可以减轻预审计的剪辑模型引入的虚假相关性，并动态评估每个组件的重要性。 Experiments on four compositional image-text matching datasets: SVO, ComVG, Winoground, and VL-checklist, and two general image-text retrieval datasets: Flick30K, and MSCOCO demonstrate the effectiveness of our plug-and-play method, which boosts the \textbf{\textit{zero-shot}} inference ability of CLIP, SLIP, and BLIP2 even without进一步的培训或微调。我们的代码可以在https://github.com/eric-ai-lab/comclip上找到。

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for matching images and text. However, it is still challenging to adapt vision-lanaguage pretrained models like CLIP to compositional image and text matching -- a more challenging image and text matching task requiring the model understanding of compositional word concepts and visual components. Towards better compositional generalization in zero-shot image and text matching, in this paper, we study the problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel \textbf{\textit{training-free}} compositional CLIP model (ComCLIP). ComCLIP disentangles input images into subjects, objects, and action sub-images and composes CLIP's vision encoder and text encoder to perform evolving matching over compositional text embedding and sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically evaluate the importance of each component. Experiments on four compositional image-text matching datasets: SVO, ComVG, Winoground, and VL-checklist, and two general image-text retrieval datasets: Flick30K, and MSCOCO demonstrate the effectiveness of our plug-and-play method, which boosts the \textbf{\textit{zero-shot}} inference ability of CLIP, SLIP, and BLIP2 even without further training or fine-tuning. Our codes can be found at https://github.com/eric-ai-lab/ComCLIP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题