语义分割的动态焦点感知位置查询

论文标题

语义分割的动态焦点感知位置查询

Dynamic Focus-aware Positional Queries for Semantic Segmentation

论文作者

He, Haoyu, Cai, Jianfei, Pan, Zizheng, Liu, Jing, Zhang, Jing, Tao, Dacheng, Zhuang, Bohan

论文摘要

类似DETR的分段已经为语义分段中的最新突破提供了基础，该分段端到端列车一组代表类原型或目标段的查询。最近，提出了蒙面的注意力，以限制每个查询仅参加前面解码器块预测的前景区域，以更容易优化。尽管很有希望，但它依赖于可学习的参数化位置查询，该查询倾向于编码数据集统计信息，从而导致针对不同单个查询的本地化不准确。在本文中，我们提出了一种简单而有效的查询设计，用于语义分割，称为动态焦点感知的位置查询（DFPQ），该查询动态生成位置查询，该查询是根据前面解码器块的交叉注意分数和相同的图像特征的位置编码的。因此，我们的DFPQ保留了目标段的丰富本地化信息，并提供准确且细粒度的位置先验。此外，我们建议仅根据低分辨率的跨科分数汇总上下文令牌来有效处理高分辨率交叉注意力，以执行局部关系聚合。对ADE20K和CityScapes进行的广泛实验表明，借助Mask2Former的两个修改，我们的框架可以通过清晰的边距为1.1％，1.9％和1.1％的单尺度MIOU，均超过Mask2Former，具有Resnet-50，Swin-t-t，Swin-t，Swin-t和Swin-b Backbone，and Ade20k的效果。源代码可从https://github.com/ziplab/faseg获得

The DETR-like segmentors have underpinned the most recent breakthroughs in semantic segmentation, which end-to-end train a set of queries representing the class prototypes or target segments. Recently, masked attention is proposed to restrict each query to only attend to the foreground regions predicted by the preceding decoder block for easier optimization. Although promising, it relies on the learnable parameterized positional queries which tend to encode the dataset statistics, leading to inaccurate localization for distinct individual queries. In this paper, we propose a simple yet effective query design for semantic segmentation termed Dynamic Focus-aware Positional Queries (DFPQ), which dynamically generates positional queries conditioned on the cross-attention scores from the preceding decoder block and the positional encodings for the corresponding image features, simultaneously. Therefore, our DFPQ preserves rich localization information for the target segments and provides accurate and fine-grained positional priors. In addition, we propose to efficiently deal with high-resolution cross-attention by only aggregating the contextual tokens based on the low-resolution cross-attention scores to perform local relation aggregation. Extensive experiments on ADE20K and Cityscapes show that with the two modifications on Mask2former, our framework achieves SOTA performance and outperforms Mask2former by clear margins of 1.1%, 1.9%, and 1.1% single-scale mIoU with ResNet-50, Swin-T, and Swin-B backbones on the ADE20K validation set, respectively. Source code is available at https://github.com/ziplab/FASeg

下载PDF全文

下载文献需遵守相关版权规定

论文标题