金字塔插图：视觉模型的分层特征对准训练

论文标题

金字塔插图：视觉模型的分层特征对准训练

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

论文作者

Gao, Yuting, Liu, Jinfeng, Xu, Zihan, Zhang, Jun, Li, Ke, Ji, Rongrong, Shen, Chunhua

论文摘要

大规模的视觉预训练已在下游任务上取得了令人鼓舞的结果。现有的方法在很大程度上依赖于以下假设：从互联网上爬立的图像纹理对是完美的一对一信件。但是，在实际情况下，这种假设可能难以持有：通过爬行图像的附属元数据获得的文本描述通常遭受语义不匹配和相互兼容性的困扰。为了解决这些问题，我们介绍了金字塔CLIP，该金字塔构建了一种输入金字塔，每种模式都具有不同的语义水平，并通过等级语义的语义对齐和跨层次的关系对齐视觉元素和语言元素。此外，我们减轻了负样品的丢失（未配对样品），以削弱在训练阶段的严格约束，从而减轻强迫模型区分兼容负面对的风险。对五个下游任务进行的实验证明了拟议的金字塔插图的有效性。特别是，具有相同数量的1500万次训练图像文本对，金字塔clip分别超过ImageNet零摄像机分类的TOP-1精度，分别在基于RESNET50/VIT50/VIT-B32/VIT-B16的Image Encoder上，TOP-1精度的TOP-1准确度高出10.6％/13.2％/10.0％。当扩展到较大的数据集时，金字塔CLIP在几个下游任务上实现了最新的结果。特别是，在143m图像文本上训练的金字塔clip-resnet50的结果使用了400m的Imagenet零摄像机分类任务的数据，超过了夹子的结果，从而显着提高了夹子的数据效率。

Large-scale vision-language pre-training has achieved promising results on downstream tasks. Existing methods highly rely on the assumption that the image-text pairs crawled from the Internet are in perfect one-to-one correspondence. However, in real scenarios, this assumption can be difficult to hold: the text description, obtained by crawling the affiliated metadata of the image, often suffers from the semantic mismatch and the mutual compatibility. To address these issues, we introduce PyramidCLIP, which constructs an input pyramid with different semantic levels for each modality, and aligns visual elements and linguistic elements in the form of hierarchy via peer-level semantics alignment and cross-level relation alignment. Furthermore, we soften the loss of negative samples (unpaired samples) so as to weaken the strict constraint during the pre-training stage, thus mitigating the risk of forcing the model to distinguish compatible negative pairs. Experiments on five downstream tasks demonstrate the effectiveness of the proposed PyramidCLIP. In particular, with the same amount of 15 million pre-training image-text pairs, PyramidCLIP exceeds CLIP on ImageNet zero-shot classification top-1 accuracy by 10.6%/13.2%/10.0% with ResNet50/ViT-B32/ViT-B16 based image encoder respectively. When scaling to larger datasets, PyramidCLIP achieves the state-of-the-art results on several downstream tasks. In particular, the results of PyramidCLIP-ResNet50 trained on 143M image-text pairs surpass that of CLIP using 400M data on ImageNet zero-shot classification task, significantly improving the data efficiency of CLIP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题