一种简单，高效且可扩展的对比度掩盖自动编码器，用于学习视觉表示

论文标题

一种简单，高效且可扩展的对比度掩盖自动编码器，用于学习视觉表示

A simple, efficient and scalable contrastive masked autoencoder for learning visual representations

论文作者

Mishra, Shlok, Robinson, Joshua, Chang, Huiwen, Jacobs, David, Sarna, Aaron, Maschinot, Aaron, Krishnan, Dilip

论文摘要

我们介绍CAN，一种简单，高效且可扩展的方法，用于自我监督的视觉表示学习。我们的框架是（c）对比度学习，（a）掩盖自动编码器以及（n）扩散模型中使用的噪声预测方法的最小和概念清洁的合成。学习机制彼此互补：对比度学习塑造了一批图像样本中的嵌入空间；蒙版自动编码器专注于单个图像样本中低频空间相关性的重建；噪声预测鼓励了图像的高频组件的重建。合并的方法可导致健壮，可扩展和简单的算法。训练过程是对称的，两种视图中有50％的斑块被随机掩盖，对先前的对比学习方法产生了相当大的提高。广泛的经验研究表明，在转移学习和鲁棒性任务的线性和鉴定评估下，可以在下游表现上实现强劲的下游表现。在ImageNet上进行预训练时，可以胜过MAE和SIMCLR，但对于在较大的未经经过的数据集（例如JFT-300M）上进行培训特别有用：对于Imagenet上的线性探针，可以达到75.4％，而SIMCLR和MAE为64.1％，可以达到73.4％。我们VIT-L模型的ImageNet上的芬特性能为86.1％，而SIMCLR为85.5％，MAE为85.4％。 SIMCLR的整体插槽负载比VIT-L模型高70％。

We introduce CAN, a simple, efficient and scalable method for self-supervised learning of visual representations. Our framework is a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models. The learning mechanisms are complementary to one another: contrastive learning shapes the embedding space across a batch of image samples; masked autoencoders focus on reconstruction of the low-frequency spatial correlations in a single image sample; and noise prediction encourages the reconstruction of the high-frequency components of an image. The combined approach results in a robust, scalable and simple-to-implement algorithm. The training process is symmetric, with 50% of patches in both views being masked at random, yielding a considerable efficiency improvement over prior contrastive learning methods. Extensive empirical studies demonstrate that CAN achieves strong downstream performance under both linear and finetuning evaluations on transfer learning and robustness tasks. CAN outperforms MAE and SimCLR when pre-training on ImageNet, but is especially useful for pre-training on larger uncurated datasets such as JFT-300M: for linear probe on ImageNet, CAN achieves 75.4% compared to 73.4% for SimCLR and 64.1% for MAE. The finetuned performance on ImageNet of our ViT-L model is 86.1%, compared to 85.5% for SimCLR, and 85.4% for MAE. The overall FLOPs load of SimCLR is 70% higher than CAN for ViT-L models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题