国家：基于变压器的广义视觉计数

论文标题

国家：基于变压器的广义视觉计数

CounTR: Transformer-based Generalised Visual Counting

论文作者

Liu, Chang, Zhong, Yujie, Zisserman, Andrew, Xie, Weidi

论文摘要

在本文中，我们考虑了使用任意数量的“示例”，即零射击或几次计数的计算模型，以开发用于计算对象数量的计算模型的目标。为此，我们做出以下四个贡献：（1）我们介绍了一种基于变压器的新型架构，用于广义视觉对象计数，称为计数变压器（乡村），该体系明确地捕获了图像贴片之间的相似性或具有注意机制的“示例”之间的相似性；（2）我们采用了两阶段的训练，我们采用了两阶段的训练，并通过自我启动进行自我启动，并遵循自我启动，并遵循自我启用，并遵循自我启用，并遵守了自我启发，并遵循了自我启发，并遵循了自我启发（并），并且遵循了自我启发，并遵循了自我启发（均为自我启动），并且遵循了自我启动的效果，并且是熟悉的，并且遵循了自我启动的启发;一种简单，可扩展的管道，用于合成大量实例或不同语义类别的训练图像，明确迫使模型使用给定的“示例”；（4）我们对大规模计数基准的彻底消融研究进行了彻底的消融研究，例如FSC-147，并在零和少量设置上展示了最先进的性能。

In this paper, we consider the problem of generalised visual object counting, with the goal of developing a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of "exemplars", i.e. zero-shot or few-shot counting. To this end, we make the following four contributions: (1) We introduce a novel transformer-based architecture for generalised visual object counting, termed as Counting Transformer (CounTR), which explicitly capture the similarity between image patches or with given "exemplars" with the attention mechanism;(2) We adopt a two-stage training regime, that first pre-trains the model with self-supervised learning, and followed by supervised fine-tuning;(3) We propose a simple, scalable pipeline for synthesizing training images with a large number of instances or that from different semantic categories, explicitly forcing the model to make use of the given "exemplars";(4) We conduct thorough ablation studies on the large-scale counting benchmark, e.g. FSC-147, and demonstrate state-of-the-art performance on both zero and few-shot settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题