均匀掩蔽：为基于金字塔的视觉变压器启用MAE预训练

论文标题

均匀掩蔽：为基于金字塔的视觉变压器启用MAE预训练

Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality

论文作者

Li, Xiang, Wang, Wenhai, Yang, Lingfeng, Yang, Jian

论文摘要

蒙面自动编码器（MAE）最近通过优雅的不对称编码器设计设计了视觉自我划分区域的趋势，该设计显着优化了预训练效率和微调精度。值得注意的是，不对称结构的成功取决于Vanilla Vision Transformer（VIT）的“全局”特性，该特性的自我发项机制的原因是离散图像贴片的任意子集。但是，目前尚不清楚如何在MAE预训练中采用基于高级金字塔的VIT（例如Pvt，Swin），因为它们通常在“本地”窗口中介绍操作员，从而难以处理部分视觉图表的随机顺序。在本文中，我们提出了均匀的掩蔽（UM），成功地为具有位置的金字塔基于金字塔的Vit（称为“ UM-MAE”），成功地培养了MAE。具体而言，UM包括一个均匀的采样（美国），严格从每个$ 2 \ times 2 $网格中取出$ 1 $随机补丁，以及一个辅助掩码（SM），该掩膜随机掩盖了一部分（通常$ 25 \％$ $），将已经采样的区域作为可学习的代币。美国保留了多个非重叠本地窗口的等效元素，从而使对流行的基于金字塔的VIT提供了平稳的支持； SM设计用于更好地传递的视觉表示，因为我们减少了像素恢复预定的难度，从而阻碍语义学习。我们证明，UM-MAE显着提高了基于金字塔的VIT的预训练效率（例如，它可以加快GPU记忆并减少$ \ sim 2 \ times $），但在下游任务中保持了竞争性的微调性能。例如，使用HTC ++检测器，仅在Imagenet-1k中使用UM-MAE的预训练的SWIN-LARGE主链自我监督甚至可以胜过Imagenet-22K中监督的一个。这些代码可在https://github.com/implus/um-mae上找到。

Masked AutoEncoder (MAE) has recently led the trends of visual self-supervision area by an elegant asymmetric encoder-decoder design, which significantly optimizes both the pre-training efficiency and fine-tuning accuracy. Notably, the success of the asymmetric structure relies on the "global" property of Vanilla Vision Transformer (ViT), whose self-attention mechanism reasons over arbitrary subset of discrete image patches. However, it is still unclear how the advanced Pyramid-based ViTs (e.g., PVT, Swin) can be adopted in MAE pre-training as they commonly introduce operators within "local" windows, making it difficult to handle the random sequence of partial vision tokens. In this paper, we propose Uniform Masking (UM), successfully enabling MAE pre-training for Pyramid-based ViTs with locality (termed "UM-MAE" for short). Specifically, UM includes a Uniform Sampling (US) that strictly samples $1$ random patch from each $2 \times 2$ grid, and a Secondary Masking (SM) which randomly masks a portion of (usually $25\%$) the already sampled regions as learnable tokens. US preserves equivalent elements across multiple non-overlapped local windows, resulting in the smooth support for popular Pyramid-based ViTs; whilst SM is designed for better transferable visual representations since US reduces the difficulty of pixel recovery pre-task that hinders the semantic learning. We demonstrate that UM-MAE significantly improves the pre-training efficiency (e.g., it speeds up and reduces the GPU memory by $\sim 2\times$) of Pyramid-based ViTs, but maintains the competitive fine-tuning performance across downstream tasks. For example using HTC++ detector, the pre-trained Swin-Large backbone self-supervised under UM-MAE only in ImageNet-1K can even outperform the one supervised in ImageNet-22K. The codes are available at https://github.com/implus/UM-MAE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题