论文标题
COMCLIP:无训练的构图图像和文本匹配
ComCLIP: Training-Free Compositional Image and Text Matching
论文作者
论文摘要
对比性语言图像预处理(剪辑)已证明了匹配图像和文本的出色零弹性性能。但是,将视觉灯笼式的模型(例如剪辑)改编成组成图像和文本匹配仍然是一项挑战,这是一个更具挑战性的图像和文本匹配任务,需要对组成单词概念和视觉组件的模型理解。在本文中,我们从因果角度研究了问题:单个实体的错误语义本质上是引起匹配失败的混杂因素。因此,我们提出了一个新颖的\ textbf {\ textit {triaghtre}}组成剪辑模型(comclip)。 COMCLIP DISENTANGLES将图像输入到主题,对象和动作子图像中,并构成了剪辑的视觉编码器和文本编码器,以对构图文本嵌入和子图像嵌入进行不断发展的匹配。通过这种方式,COMCLIP可以减轻预审计的剪辑模型引入的虚假相关性,并动态评估每个组件的重要性。 Experiments on four compositional image-text matching datasets: SVO, ComVG, Winoground, and VL-checklist, and two general image-text retrieval datasets: Flick30K, and MSCOCO demonstrate the effectiveness of our plug-and-play method, which boosts the \textbf{\textit{zero-shot}} inference ability of CLIP, SLIP, and BLIP2 even without进一步的培训或微调。我们的代码可以在https://github.com/eric-ai-lab/comclip上找到。
Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for matching images and text. However, it is still challenging to adapt vision-lanaguage pretrained models like CLIP to compositional image and text matching -- a more challenging image and text matching task requiring the model understanding of compositional word concepts and visual components. Towards better compositional generalization in zero-shot image and text matching, in this paper, we study the problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel \textbf{\textit{training-free}} compositional CLIP model (ComCLIP). ComCLIP disentangles input images into subjects, objects, and action sub-images and composes CLIP's vision encoder and text encoder to perform evolving matching over compositional text embedding and sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically evaluate the importance of each component. Experiments on four compositional image-text matching datasets: SVO, ComVG, Winoground, and VL-checklist, and two general image-text retrieval datasets: Flick30K, and MSCOCO demonstrate the effectiveness of our plug-and-play method, which boosts the \textbf{\textit{zero-shot}} inference ability of CLIP, SLIP, and BLIP2 even without further training or fine-tuning. Our codes can be found at https://github.com/eric-ai-lab/ComCLIP.