与视觉变形金刚在弱监督的语义细分中调和阶级和形状的最大汇总

论文标题

与视觉变形金刚在弱监督的语义细分中调和阶级和形状的最大汇总

Max Pooling with Vision Transformers reconciles class and shape in weakly supervised semantic segmentation

论文作者

Rossetti, Simone, Zappia, Damiano, Sanzari, Marta, Schaerf, Marco, Pirri, Fiora

论文摘要

弱监督的语义细分（WSSS）研究探索了许多方向，以改善典型的管道CNN加上类激活地图（CAM）和修补措施，鉴于图像级标签是唯一的监督。尽管有完全监督方法的差距减少了，但在此框架中，进一步减少差异似乎不太可能。另一方面，基于视觉变压器（VIT）的WSSS方法尚未探索CAM的有效替代方案。 VIT功能已显示出可以保留场景布局，并且在自我监督学习中的对象边界。为了确认这些发现，我们证明，通过全球最大池（GMP）进一步加强了变压器在自我监督方法中的优势，这些方法可以利用贴片特征以类概率来协商像素标签的概率。这项工作提出了一种新的WSSS方法，称为VIT-PCM（VIT Patch-class-class映射），而不是基于CAM。端到端提出的网络通过单个优化过程，精制形状和分割掩模的适当定位学习。我们的模型的表现优于基线伪面罩（BPM）的最先进，我们在Pascalvoc 2012 $ Val $ set上获得了69.3美元的$ 69.3 \％$ MIOU。我们表明，我们的方法具有最少的参数，尽管获得的准确性比所有其他方法都更高。在句子中，我们方法的定量和定性结果表明，VIT-PCM是基于CNN-CAM架构的绝佳替代方法。

Weakly Supervised Semantic Segmentation (WSSS) research has explored many directions to improve the typical pipeline CNN plus class activation maps (CAM) plus refinements, given the image-class label as the only supervision. Though the gap with the fully supervised methods is reduced, further abating the spread seems unlikely within this framework. On the other hand, WSSS methods based on Vision Transformers (ViT) have not yet explored valid alternatives to CAM. ViT features have been shown to retain a scene layout, and object boundaries in self-supervised learning. To confirm these findings, we prove that the advantages of transformers in self-supervised methods are further strengthened by Global Max Pooling (GMP), which can leverage patch features to negotiate pixel-label probability with class probability. This work proposes a new WSSS method dubbed ViT-PCM (ViT Patch-Class Mapping), not based on CAM. The end-to-end presented network learns with a single optimization process, refined shape and proper localization for segmentation masks. Our model outperforms the state-of-the-art on baseline pseudo-masks (BPM), where we achieve $69.3\%$ mIoU on PascalVOC 2012 $val$ set. We show that our approach has the least set of parameters, though obtaining higher accuracy than all other approaches. In a sentence, quantitative and qualitative results of our method reveal that ViT-PCM is an excellent alternative to CNN-CAM based architectures.

下载PDF全文

下载文献需遵守相关版权规定

论文标题