暹罗图像建模用于自我监督的视力表示学习

论文标题

暹罗图像建模用于自我监督的视力表示学习

Siamese Image Modeling for Self-Supervised Vision Representation Learning

论文作者

Tao, Chenxin, Zhu, Xizhou, Su, Weijie, Huang, Gao, Li, Bin, Zhou, Jie, Qiao, Yu, Wang, Xiaogang, Dai, Jifeng

论文摘要

自我监督的学习（SSL）在各种下游视觉任务上表现出色。已经提出了两个主流SSL框架，即实例歧视（ID）和蒙版图像建模（MIM）。 ID从同一图像的不同视图中拉出表示形式，同时避免特征崩溃。它缺乏空间灵敏度，这需要对每个图像中的局部结构进行建模。另一方面，MIM重建了给定蒙版图像的原始内容。相反，它没有良好的语义一致性，这需要将语义上相似的观点投射到附近的表示形式中。为了解决这一难题，我们观察到（1）可以通过将不同的图像视图与强大的增强匹配来实现语义一致性；（2）空间敏感性可以从预测用掩盖图像的密集表示中受益。在这些分析的驱动下，我们提出了暹罗图像建模（Siameseim），该图像基于同一图像的另一种掩盖视图，但具有不同的增强，可以预测增强视图的密集表示。 Siameseim使用一个带有两个分支的暹罗网络。在线分支编码第一个视图，并根据这两个视图之间的相对位置预测第二视图的表示。目标分支通过编码第二视图来产生目标。 Siameseim可以在各种下游任务上超越ID和MIM，包括ImageNet Finetuning和线性探测，可可和LVIS检测以及ADE20K语义分割。在几次，长尾和稳健性的场景中，改进更为显着。代码应在https://github.com/fundamentalvision/siamese-image-modeling上发布。

Self-supervised learning (SSL) has delivered superior performance on a variety of downstream vision tasks. Two main-stream SSL frameworks have been proposed, i.e., Instance Discrimination (ID) and Masked Image Modeling (MIM). ID pulls together representations from different views of the same image, while avoiding feature collapse. It lacks spatial sensitivity, which requires modeling the local structure within each image. On the other hand, MIM reconstructs the original content given a masked image. It instead does not have good semantic alignment, which requires projecting semantically similar views into nearby representations. To address this dilemma, we observe that (1) semantic alignment can be achieved by matching different image views with strong augmentations; (2) spatial sensitivity can benefit from predicting dense representations with masked images. Driven by these analysis, we propose Siamese Image Modeling (SiameseIM), which predicts the dense representations of an augmented view, based on another masked view from the same image but with different augmentations. SiameseIM uses a Siamese network with two branches. The online branch encodes the first view, and predicts the second view's representation according to the relative positions between these two views. The target branch produces the target by encoding the second view. SiameseIM can surpass both ID and MIM on a wide range of downstream tasks, including ImageNet finetuning and linear probing, COCO and LVIS detection, and ADE20k semantic segmentation. The improvement is more significant in few-shot, long-tail and robustness-concerned scenarios. Code shall be released at https://github.com/fundamentalvision/Siamese-Image-Modeling.

下载PDF全文

下载文献需遵守相关版权规定

论文标题