MIMCO：与对比老师的蒙版图像建模预训练

论文标题

MIMCO：与对比老师的蒙版图像建模预训练

MimCo: Masked Image Modeling Pre-training with Contrastive Teacher

论文作者

Zhou, Qiang, Yu, Chaohui, Luo, Hao, Wang, Zhibin, Li, Hao

论文摘要

最近的蒙版图像建模（MIM）在自我监督的学习（SSL）中受到了很多关注，该学习要求目标模型恢复输入图像的掩盖部分。尽管基于MIM的预训练方法在转移到许多下游任务时达到了新的最先进的性能，但可视化表明，与基于基于对比性学习预训练相比，学习的表示形式不可分割。这激发了我们思考MIM预培训表示的线性可分离性是否可以进一步改善，从而提高了训练的性能。由于MIM和对比度学习倾向于利用不同的数据增强和培训策略，因此将这两个借口任务结合起来并不是微不足道的。在这项工作中，我们提出了一个新颖而灵活的预训练框架，名为Mimco，该框架通过两阶段的预训练结合了MIM和对比度学习。具体而言，MIMCO将预训练的对比学习模型作为教师模型，并通过两种类型的学习目标进行了预培训：贴片级和图像级的重建损失。关于下游任务的广泛转移实验证明了我们的MIMCO预训练框架的出色表现。以VIT-S为例，当使用预训练的MoCov3-Vit-S作为教师模型时，Mimco只需要100个时期的预培训即可在Imagenet-1K上实现82.53％的TOP-1 FINETUNENT精度，这表现优于最先进的自助服务的学习对手。

Recent masked image modeling (MIM) has received much attention in self-supervised learning (SSL), which requires the target model to recover the masked part of the input image. Although MIM-based pre-training methods achieve new state-of-the-art performance when transferred to many downstream tasks, the visualizations show that the learned representations are less separable, especially compared to those based on contrastive learning pre-training. This inspires us to think whether the linear separability of MIM pre-trained representation can be further improved, thereby improving the pre-training performance. Since MIM and contrastive learning tend to utilize different data augmentations and training strategies, combining these two pretext tasks is not trivial. In this work, we propose a novel and flexible pre-training framework, named MimCo, which combines MIM and contrastive learning through two-stage pre-training. Specifically, MimCo takes a pre-trained contrastive learning model as the teacher model and is pre-trained with two types of learning targets: patch-level and image-level reconstruction losses. Extensive transfer experiments on downstream tasks demonstrate the superior performance of our MimCo pre-training framework. Taking ViT-S as an example, when using the pre-trained MoCov3-ViT-S as the teacher model, MimCo only needs 100 epochs of pre-training to achieve 82.53% top-1 finetuning accuracy on Imagenet-1K, which outperforms the state-of-the-art self-supervised learning counterparts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题