通用的几个射击语义分段的预测校准

论文标题

通用的几个射击语义分段的预测校准

Prediction Calibration for Generalized Few-shot Semantic Segmentation

论文作者

Lu, Zhihe, He, Sen, Li, Da, Song, Yi-Zhe, Xiang, Tao

论文摘要

概括的少数语义分割（GFSS）旨在将每个图像像素分割成具有丰富训练示例的基类，或者只有少数（例如1-5）训练图像的基本类别。与经过广泛研究的少量语义分割FS相比，仅限于细分新课程，尽管更实用，但GFSS的研究很少。现有的GFSS方法基于分类器参数融合，从而将新训练的小说类别分类器和预训练的基类分类器组合在一起以形成新的分类器。由于培训数据以基础类别为主，因此这种方法不可避免地偏向基础类别。在这项工作中，我们提出了一个新颖的预测校准网络PCN来解决此问题。我们没有融合分类器参数，而是融合由基础和新型分类器分别产生的分数。为了确保融合得分不会偏向基础或新型类别，引入了新的基于变压器的校准模块。众所周知，低级别的功能可用于检测输入图像中的边缘信息，而不是高级功能。因此，我们构建了一个交叉发音模块，该模块使用融合的多级功能来指导分类器的最终预测。但是，变压器在计算上是要求的。至关重要的是，为了使所提出的跨意义模块训练可在像素级别进行操作，该模块的设计基于特征得分的交叉交互，并经过情节训练，可在推理时进行推广。 Pascal- $ 5^{i} $和可可 - $ 20^{i} $的大量实验表明，我们的PCN胜过大幅度的替代方案。

Generalized Few-shot Semantic Segmentation (GFSS) aims to segment each image pixel into either base classes with abundant training examples or novel classes with only a handful of (e.g., 1-5) training images per class. Compared to the widely studied Few-shot Semantic Segmentation FSS, which is limited to segmenting novel classes only, GFSS is much under-studied despite being more practical. Existing approach to GFSS is based on classifier parameter fusion whereby a newly trained novel class classifier and a pre-trained base class classifier are combined to form a new classifier. As the training data is dominated by base classes, this approach is inevitably biased towards the base classes. In this work, we propose a novel Prediction Calibration Network PCN to address this problem. Instead of fusing the classifier parameters, we fuse the scores produced separately by the base and novel classifiers. To ensure that the fused scores are not biased to either the base or novel classes, a new Transformer-based calibration module is introduced. It is known that the lower-level features are useful of detecting edge information in an input image than higher-level features. Thus, we build a cross-attention module that guides the classifier's final prediction using the fused multi-level features. However, transformers are computationally demanding. Crucially, to make the proposed cross-attention module training tractable at the pixel level, this module is designed based on feature-score cross-covariance and episodically trained to be generalizable at inference time. Extensive experiments on PASCAL-$5^{i}$ and COCO-$20^{i}$ show that our PCN outperforms the state-the-the-art alternatives by large margins.

下载PDF全文

下载文献需遵守相关版权规定

论文标题