in-n-n-out生成学习，用于密集的无监督视频细分

论文标题

in-n-n-out生成学习，用于密集的无监督视频细分

In-N-Out Generative Learning for Dense Unsupervised Video Segmentation

论文作者

Pan, Xiao, Li, Peike, Yang, Zongxin, Zhou, Huiling, Zhou, Chang, Yang, Hongxia, Zhou, Jingren, Yang, Yi

论文摘要

在本文中，我们专注于对视频对象细分（VOS）的无监督学习，该学习从未标记的视频中学习了视觉对应（即像素级功能之间的相似性）。以前的方法主要基于对比度学习范式，该范式在图像级或像素级别中优化。图像级优化（例如，Resnet的空间汇总功能）学习了强大的高级语义，但由于像素级特征被隐式优化。相比之下，像素级优化更加明确，但是，它对训练数据的视觉质量敏感，并且对物体变形不稳定。为了在统一的框架中互补地执行这两个级别的优化，我们从纯粹的生成性角度提出了纯粹的（INO）生成学习，并在Vision Transformer（VIT）中的自然设计的类令牌和贴片令牌的帮助下。具体来说，对于图像级优化，我们迫使从本地视图到全球视图对类代币的观点进行了观察，这有助于捕获高级语义，并将其称为量基的学习。至于像素级优化，我们在贴片令牌上执行了视图中的掩盖图像建模，该图像通过推断图像的细粒结构恢复了图像的损坏部分，并且我们将其称为基础上的学习。为了更好地发现时间信息，我们还强迫特征和亲和力矩阵级别的框架间一致性。在戴维斯-2017 Val和YouTube-Vos 2018 Val上进行的广泛实验表明，我们的INO优于先前的最先进方法，这是大幅度的。代码可用：https：//github.com/pansanity666/ino_vos

In this paper, we focus on unsupervised learning for Video Object Segmentation (VOS) which learns visual correspondence (i.e., the similarity between pixel-level features) from unlabeled videos. Previous methods are mainly based on the contrastive learning paradigm, which optimize either in image level or pixel level. Image-level optimization (e.g., the spatially pooled feature of ResNet) learns robust high-level semantics but is sub-optimal since the pixel-level features are optimized implicitly. By contrast, pixel-level optimization is more explicit, however, it is sensitive to the visual quality of training data and is not robust to object deformation. To complementarily perform these two levels of optimization in a unified framework, we propose the In-aNd-Out (INO) generative learning from a purely generative perspective with the help of naturally designed class tokens and patch tokens in Vision Transformer (ViT). Specifically, for image-level optimization, we force the out-view imagination from local to global views on class tokens, which helps capture high-level semantics, and we name it as out-generative learning. As to pixel-level optimization, we perform in-view masked image modeling on patch tokens, which recovers the corrupted parts of an image via inferring its fine-grained structure, and we term it as in-generative learning. To discover the temporal information better, we additionally force the inter-frame consistency from both feature and affinity matrix levels. Extensive experiments on DAVIS-2017 val and YouTube-VOS 2018 val show that our INO outperforms previous state-of-the-art methods by significant margins. Code is available: https://github.com/pansanity666/INO_VOS

下载PDF全文

下载文献需遵守相关版权规定

论文标题