论文标题
带有MaskClip的开放式通用图像分割
Open-Vocabulary Universal Image Segmentation with MaskCLIP
论文作者
论文摘要
在本文中,我们旨在针对在推理时间中基于文本的描述的任意类别来执行语义/实例/实例分段(背景语义标签 +前景实例段),以解决新兴的计算机视觉任务,开放式通用图像分割。我们首先通过直接采用预先训练的剪辑模型而无需填充或蒸馏来构建基线方法。然后,我们开发了MaskClip,这是一种基于变压器的方法,它具有MaskClip Visual编码器,该方法是一个仅编码的模块,它无缝地将蒙版令牌与预先训练的VIT剪辑模型集成了语义/实例分割和类预测。 MaskClip学会了在MaskClip Visual编码器中有效,有效地利用预先训练的部分/密集剪辑功能,从而避免了耗时的学生教师培训过程。 MaskClip优于ADE20K和PASCAL数据集上语义/实例/全景分割的先前方法。我们通过在线自定义类别显示MaskClip的定性插图。项目网站:https://maskClip.github.io。
In this paper, we tackle an emerging computer vision task, open-vocabulary universal image segmentation, that aims to perform semantic/instance/panoptic segmentation (background semantic labeling + foreground instance segmentation) for arbitrary categories of text-based descriptions in inference time. We first build a baseline method by directly adopting pre-trained CLIP models without finetuning or distillation. We then develop MaskCLIP, a Transformer-based approach with a MaskCLIP Visual Encoder, which is an encoder-only module that seamlessly integrates mask tokens with a pre-trained ViT CLIP model for semantic/instance segmentation and class prediction. MaskCLIP learns to efficiently and effectively utilize pre-trained partial/dense CLIP features within the MaskCLIP Visual Encoder that avoids the time-consuming student-teacher training process. MaskCLIP outperforms previous methods for semantic/instance/panoptic segmentation on ADE20K and PASCAL datasets. We show qualitative illustrations for MaskCLIP with online custom categories. Project website: https://maskclip.github.io.