论文标题
通过探测注意力条件的掩盖一致性来调整自我监督的视觉变压器
Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency
论文作者
论文摘要
视觉域的适应性(DA)试图将经过训练的模型转移到跨分布转移的未看到的,未标记的域,但是方法通常着重于适应卷积神经网络架构,并使用有监督的Imagenet表示初始化。在这项工作中,我们将重点转移到将现代体系结构改编成对象识别的重点 - 越来越受欢迎的视觉变压器(VIT)以及基于自我监督的学习(SSL)的现代预处理。受到最新SSL方法的启发,该方法是基于通过掩盖或裁剪生成的部分图像输入学习的 - 要么通过学习预测缺失的像素或学习代表性不向这种增强的学习来启发,因此我们提出了一种简单的两阶段适应性算法PACMAC,用于自助耐药的Vits。 PACMAC首先在合并的源和目标数据上执行内域SSL,以学习任务歧视性特征,然后探究该模型的预测性一致性,这些一组通过新的注意力条件遮罩策略生成的部分目标输入,以识别可靠的候选者进行自我培训。我们简单的方法会导致对使用VIT和对标准对象识别基准的自我监督初始化的竞争方法的持续性能提高。可在https://github.com/virajprabhu/pacmac上找到代码
Visual domain adaptation (DA) seeks to transfer trained models to unseen, unlabeled domains across distribution shift, but approaches typically focus on adapting convolutional neural network architectures initialized with supervised ImageNet representations. In this work, we shift focus to adapting modern architectures for object recognition -- the increasingly popular Vision Transformer (ViT) -- and modern pretraining based on self-supervised learning (SSL). Inspired by the design of recent SSL approaches based on learning from partial image inputs generated via masking or cropping -- either by learning to predict the missing pixels, or learning representational invariances to such augmentations -- we propose PACMAC, a simple two-stage adaptation algorithm for self-supervised ViTs. PACMAC first performs in-domain SSL on pooled source and target data to learn task-discriminative features, and then probes the model's predictive consistency across a set of partial target inputs generated via a novel attention-conditioned masking strategy, to identify reliable candidates for self-training. Our simple approach leads to consistent performance gains over competing methods that use ViTs and self-supervised initializations on standard object recognition benchmarks. Code available at https://github.com/virajprabhu/PACMAC