标题值一千张图像吗？代表学习的对照研究

论文标题

标题值一千张图像吗？代表学习的对照研究

Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning

论文作者

Santurkar, Shibani, Dubois, Yann, Taori, Rohan, Liang, Percy, Hashimoto, Tatsunori

论文摘要

剪辑的发展[Radford等，2021]引发了关于语言监督是否可以导致与传统仅图像方法更可转移表示的视觉模型的争论。我们的工作通过对两种方法的学习能力进行了对下游分类任务的学习能力进行仔细控制的比较来研究这个问题。我们发现，当预训练数据集符合某些标准时 - 它足够大，并且包含具有较低变异性的描述性字幕 - 仅图像的方法也与夹子的传输性能不匹配，即使它们接受了更多图像数据的培训。但是，与人们期望的相反，在某些情况下，没有满足这些标准，其中通过字幕增加的监督实际上是有害的。在我们的发现中，我们设计了简单的处方，以使剪辑能够更好地利用现有预训练数据集中存在的语言信息。

The development of CLIP [Radford et al., 2021] has sparked a debate on whether language supervision can result in vision models with more transferable representations than traditional image-only methods. Our work studies this question through a carefully controlled comparison of two approaches in terms of their ability to learn representations that generalize to downstream classification tasks. We find that when the pre-training dataset meets certain criteria -- it is sufficiently large and contains descriptive captions with low variability -- image-only methods do not match CLIP's transfer performance, even when they are trained with more image data. However, contrary to what one might expect, there are practical settings in which these criteria are not met, wherein added supervision through captions is actually detrimental. Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题