通过变压器具有隐式空间校准的弱监督物体定位

论文标题

通过变压器具有隐式空间校准的弱监督物体定位

Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration

论文作者

Bai, Haotian, Zhang, Ruimao, Wang, Jiong, Wan, Xiang

论文摘要

旨在仅使用图像级标签来定位对象的弱监督对象定位（WSOL），由于其在实际应用中的注释成本较低，因此引起了很多关注。最近的研究利用自我发明在视觉变压器中的优势将长期依赖性用于重新活跃的语义区域，旨在避免在传统的类激活映射（CAM）中部分激活。但是，变压器中的远距离建模忽略了对象的固有空间连贯性，并且通常会扩散远离对象边界的语义感知区域，从而使本地化结果显着更大或更小。为了解决此类问题，我们引入了一个简单而有效的空间校准模块（SCM），以进行准确的WSOL，将斑块令牌的语义相似性及其空间关系融合到统一的扩散模型中。具体而言，我们引入了一个可学习的参数，以动态调整语义相关性和空间上下文强度以进行有效的信息传播。实际上，SCM被设计为变压器的外部模块，可以在推断过程中删除以降低计算成本。对象敏感的定位能力通过在训练阶段的优化中隐式地嵌入变压器编码器中。它使生成的注意图能够捕获更清晰的对象边界并过滤对象 - iRrelevant背景区域。广泛的实验结果证明了该方法的有效性，该方法在CUB-200和Imagenet-1K基准测试基准上都显着优于其对应物TS-CAM。该代码可在https://github.com/164140757/scm上找到。

Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Recent studies leverage the advantage of self-attention in visual Transformer for long-range dependency to re-active semantic regions, aiming to avoid partial activation in traditional class activation mapping (CAM). However, the long-range modeling in Transformer neglects the inherent spatial coherence of the object, and it usually diffuses the semantic-aware regions far from the object boundary, making localization results significantly larger or far smaller. To address such an issue, we introduce a simple yet effective Spatial Calibration Module (SCM) for accurate WSOL, incorporating semantic similarities of patch tokens and their spatial relationships into a unified diffusion model. Specifically, we introduce a learnable parameter to dynamically adjust the semantic correlations and spatial context intensities for effective information propagation. In practice, SCM is designed as an external module of Transformer, and can be removed during inference to reduce the computation cost. The object-sensitive localization ability is implicitly embedded into the Transformer encoder through optimization in the training phase. It enables the generated attention maps to capture the sharper object boundaries and filter the object-irrelevant background area. Extensive experimental results demonstrate the effectiveness of the proposed method, which significantly outperforms its counterpart TS-CAM on both CUB-200 and ImageNet-1K benchmarks. The code is available at https://github.com/164140757/SCM.

下载PDF全文

下载文献需遵守相关版权规定

论文标题