论文标题
探索对比对比语言图像预训练的视觉解释性
Exploring Visual Interpretability for Contrastive Language-Image Pre-training
论文作者
论文摘要
对比语言图像预训练(剪辑)通过随时可用的自然语言监督学习丰富的表示。它提高了下游视觉任务的性能,包括但不限于零射击,长尾巴,细分,检索,字幕和视频。但是,很少研究夹的视觉解释性,尤其是对于原始特征图。为了提供其预测的视觉解释,我们提出了图像文本相似性图(ITSM)。基于它,我们出乎意料地发现,剪辑比前景更喜欢背景区域,并且显示出与人类理解相对于人类理解的错误可视化结果。这种现象对于视觉变压器和卷积网络都是普遍的,这表明此问题是独一无二的,而不是由于某些网络而引起的。在实验上,我们发现魔鬼位于汇总部分,其中不适当的合并方法导致一种称为语义转移的现象。对于这个问题,我们提出了可解释的对比语言图像预训练(ECLIP),该图像通过掩盖的最大池纠正了解释性。具体来说,为了避免语义转移,我们通过最大池替换了原始的注意力集合,以专注于自信的前景,并在训练过程中自由注意力提供了指导。三个数据集的实验表明,Eclip大大提高了剪辑的解释性,超出了以前的解释性方法。该代码将稍后发布。
Contrastive Language-Image Pre-training (CLIP) learns rich representations via readily available supervision of natural language. It improves the performance of downstream vision tasks, including but not limited to the zero-shot, long tail, segmentation, retrieval, caption, and video. However, the visual explainability of CLIP is rarely studied, especially for the raw feature map. To provide visual explanations of its predictions, we propose the Image-Text Similarity Map (ITSM). Based on it, we surprisingly find that CLIP prefers the background regions than the foregrounds, and shows erroneous visualization results against human understanding. This phenomenon is universal for both vision transformers and convolutional networks, which suggests this problem is unique and not owing to certain network. Experimentally, we find the devil is in the pooling part, where inappropriate pooling methods lead to a phenomenon called semantic shift. For this problem, we propose the Explainable Contrastive Language-Image Pre-training (ECLIP), which corrects the explainability via the Masked Max Pooling. Specifically, to avoid the semantic shift, we replace the original attention pooling by max pooling to focus on the confident foreground, with guidance from free attention during training. Experiments on three datasets suggest that ECLIP greatly improves the explainability of CLIP, and beyond previous explainability methods at large margins. The code will be released later.