论文标题
Cyclip:循环对比语言图像预处理
CyCLIP: Cyclic Contrastive Language-Image Pretraining
论文作者
论文摘要
对比度表示学习的最新进展是对成对的图像文本数据的最新进展,导致了诸如剪辑之类的模型,这些模型以实现零摄像分类和分布鲁棒性的最新性能。这样的模型通常需要在图像和文本表示空间中进行下游推理任务的关节推理。与先前的信念相反,我们证明,通过标准对比目标学习的图像和文本表示不可互换,并且可能导致下游预测不一致。为了减轻此问题,我们将一致性正式化并提出了Cyclip,这是对比度表示学习的框架,该框架明确优化了学到的表示形式,以在图像和文本空间中在几何上保持一致。特别是,我们表明可以通过明确对称(a)两个不匹配的图像文本对(交叉模式一致性)之间的相似性来学习一致的表示。 (b)图像图像对和文本文本对(模式内一致性)之间的相似性。从经验上讲,我们表明,Cyclip的一致性的提高转化为夹子上的显着增长,其增长范围为标准基准测试的零摄像机分类精度的10%-24%(CIFAR-10,CIFAR-10,CIFAR-100,ImagenEt1K),可鲁棒性的稳健性到各种自然分布转移。该代码可在https://github.com/goel-shashank/cyclip上找到。
Recent advances in contrastive representation learning over paired image-text data have led to models such as CLIP that achieve state-of-the-art performance for zero-shot classification and distributional robustness. Such models typically require joint reasoning in the image and text representation spaces for downstream inference tasks. Contrary to prior beliefs, we demonstrate that the image and text representations learned via a standard contrastive objective are not interchangeable and can lead to inconsistent downstream predictions. To mitigate this issue, we formalize consistency and propose CyCLIP, a framework for contrastive representation learning that explicitly optimizes for the learned representations to be geometrically consistent in the image and text space. In particular, we show that consistent representations can be learned by explicitly symmetrizing (a) the similarity between the two mismatched image-text pairs (cross-modal consistency); and (b) the similarity between the image-image pair and the text-text pair (in-modal consistency). Empirically, we show that the improved consistency in CyCLIP translates to significant gains over CLIP, with gains ranging from 10%-24% for zero-shot classification accuracy on standard benchmarks (CIFAR-10, CIFAR-100, ImageNet1K) and 10%-27% for robustness to various natural distribution shifts. The code is available at https://github.com/goel-shashank/CyCLIP.