论文标题
SimpleClick:简单视觉变压器的交互式图像分割
SimpleClick: Interactive Image Segmentation with Simple Vision Transformers
论文作者
论文摘要
基于单击的交互式图像分割旨在用有限的用户点击提取对象。分层骨干是当前方法的事实上架构。最近,普通的非等级视觉变压器(VIT)已成为密集预测任务的竞争主链。这种设计使原始VIT成为一个基础模型,可以对下游任务进行填充,而无需重新设计层次结构的主链以进行训练。尽管该设计很简单并且已被证明有效,但尚未进行交互式图像分割探索。为了填补这一空白,我们提出了SimpleClick,这是利用普通骨架的第一种交互式分割方法。基于普通的骨干,我们引入了一个对称补丁嵌入层,该图层对骨干本身进行了较小的修改,将单击的单击编码到主链中。凭借普通的骨干为蒙版的自动编码器(MAE),SimpleClick实现了最先进的性能。值得注意的是,我们的方法在SBD上达到4.15 NOC@90,比以前的最佳结果提高了21.8%。对医学图像的广泛评估证明了我们方法的普遍性。我们进一步开发了一个非常小的VIT主链,以简单地使用,并提供详细的计算分析,强调了其作为实用注释工具的适用性。
Click-based interactive image segmentation aims at extracting objects with a limited user clicking. A hierarchical backbone is the de-facto architecture for current methods. Recently, the plain, non-hierarchical Vision Transformer (ViT) has emerged as a competitive backbone for dense prediction tasks. This design allows the original ViT to be a foundation model that can be finetuned for downstream tasks without redesigning a hierarchical backbone for pretraining. Although this design is simple and has been proven effective, it has not yet been explored for interactive image segmentation. To fill this gap, we propose SimpleClick, the first interactive segmentation method that leverages a plain backbone. Based on the plain backbone, we introduce a symmetric patch embedding layer that encodes clicks into the backbone with minor modifications to the backbone itself. With the plain backbone pretrained as a masked autoencoder (MAE), SimpleClick achieves state-of-the-art performance. Remarkably, our method achieves 4.15 NoC@90 on SBD, improving 21.8% over the previous best result. Extensive evaluation on medical images demonstrates the generalizability of our method. We further develop an extremely tiny ViT backbone for SimpleClick and provide a detailed computational analysis, highlighting its suitability as a practical annotation tool.