Segvit：具有透明视觉变压器的语义分割

论文标题

Segvit：具有透明视觉变压器的语义分割

SegViT: Semantic Segmentation with Plain Vision Transformers

论文作者

Zhang, Bowen, Tian, Zhi, Tang, Quan, Chu, Xiangxiang, Wei, Xiaolin, Shen, Chunhua, Liu, Yifan

论文摘要

我们探讨了纯视觉变压器（VIT）的语义分割的能力，并提出了Segvit。以前的基于VIT的分割网络通常从VIT的输出中学习像素级表示。以不同的方式，我们利用基本组成部分 - 注意机制，生成语义分割的面具。具体而言，我们提出了注意力面罩（ATM）模块，其中一组可学习的类令牌和空间特征图之间的相似性图被传输到分段蒙版。实验表明，我们建议使用ATM模块的SEGVIT使用ADE20K数据集上的普通VIT主链优于其对应物，并在可可-Stuff-10k和Pascal-Context数据集上实现了新的最先进的性能。此外，为了降低VIT主链的计算成本，我们提出了基于查询的下采样（QD）和基于查询的上采样（QU）以构建缩水结构。借助拟议的缩水结构，该模型可以节省高达$ 40 \％$ $的计算，同时保持竞争性能。

We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation and propose the SegVit. Previous ViT-based segmentation networks usually learn a pixel-level representation from the output of the ViT. Differently, we make use of the fundamental component -- attention mechanism, to generate masks for semantic segmentation. Specifically, we propose the Attention-to-Mask (ATM) module, in which the similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks. Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the ViT backbone, we propose query-based down-sampling (QD) and query-based up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk structure, the model can save up to $40\%$ computations while maintaining competitive performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题