凝视着您所看到的：无重建的掩盖图像建模

论文标题

凝视着您所看到的：无重建的掩盖图像建模

Stare at What You See: Masked Image Modeling without Reconstruction

论文作者

Xue, Hongwei, Gao, Peng, Li, Hongyang, Qiao, Yu, Sun, Hao, Li, Houqiang, Luo, Jiebo

论文摘要

蒙面的自动编码器（MAE）是大规模视力表示预训练的范式。通过从可见图像区域的一小部分重建掩盖的图像贴片，MAE迫使模型推断图像中的语义相关性。最近，某些方法采用语义丰富的教师模型来提取图像特征作为重建目标，从而提高性能。但是，与像素值之类的低级功能不同，我们认为，功能强大的教师模型所提取的功能已经编码完整图像中跨区域之间的丰富语义相关性。这提出了一个问题：是否在掩盖图像建模（MIM）中使用教师模型中需要重建？在本文中，我们提出了一个名为MaskAlign的有效MIM范式。 MaskAlign简单地了解学生模型提取的可见补丁功能的一致性以及教师模型提取的完整图像功能。为了进一步提高绩效并解决学生模型与教师模型之间的投入不一致问题，我们建议动态对齐（DA）模块应用可学习的一致性。我们的实验结果表明，即使没有掩盖区域的重建，掩盖的建模也不会失去效力。结合动态对齐，MaskAlign可以以更高的效率实现最先进的性能。代码和型号将在https://github.com/openperceptionx/maskAlign上找到。

Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training. By reconstructing masked image patches from a small portion of visible image regions, MAE forces the model to infer semantic correlation within an image. Recently, some approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance. However, unlike the low-level features such as pixel values, we argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.This raises one question: is reconstruction necessary in Masked Image Modeling (MIM) with a teacher model? In this paper, we propose an efficient MIM paradigm named MaskAlign. MaskAlign simply learns the consistency of visible patch features extracted by the student model and intact image features extracted by the teacher model. To further advance the performance and tackle the problem of input inconsistency between the student and teacher model, we propose a Dynamic Alignment (DA) module to apply learnable alignment. Our experimental results demonstrate that masked modeling does not lose effectiveness even without reconstruction on masked regions. Combined with Dynamic Alignment, MaskAlign can achieve state-of-the-art performance with much higher efficiency. Code and models will be available at https://github.com/OpenPerceptionX/maskalign.

下载PDF全文

下载文献需遵守相关版权规定

论文标题