论文标题
国家:基于变压器的广义视觉计数
CounTR: Transformer-based Generalised Visual Counting
论文作者
论文摘要
在本文中,我们考虑了使用任意数量的“示例”,即零射击或几次计数的计算模型,以开发用于计算对象数量的计算模型的目标。为此,我们做出以下四个贡献:(1)我们介绍了一种基于变压器的新型架构,用于广义视觉对象计数,称为计数变压器(乡村),该体系明确地捕获了图像贴片之间的相似性或具有注意机制的“示例”之间的相似性;(2)我们采用了两阶段的训练,我们采用了两阶段的训练,并通过自我启动进行自我启动,并遵循自我启动,并遵循自我启用,并遵循自我启用,并遵守了自我启发,并遵循了自我启发,并遵循了自我启发(并),并且遵循了自我启发,并遵循了自我启发(均为自我启动),并且遵循了自我启动的效果,并且是熟悉的,并且遵循了自我启动的启发;一种简单,可扩展的管道,用于合成大量实例或不同语义类别的训练图像,明确迫使模型使用给定的“示例”;(4)我们对大规模计数基准的彻底消融研究进行了彻底的消融研究,例如FSC-147,并在零和少量设置上展示了最先进的性能。
In this paper, we consider the problem of generalised visual object counting, with the goal of developing a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of "exemplars", i.e. zero-shot or few-shot counting. To this end, we make the following four contributions: (1) We introduce a novel transformer-based architecture for generalised visual object counting, termed as Counting Transformer (CounTR), which explicitly capture the similarity between image patches or with given "exemplars" with the attention mechanism;(2) We adopt a two-stage training regime, that first pre-trains the model with self-supervised learning, and followed by supervised fine-tuning;(3) We propose a simple, scalable pipeline for synthesizing training images with a large number of instances or that from different semantic categories, explicitly forcing the model to make use of the given "exemplars";(4) We conduct thorough ablation studies on the large-scale counting benchmark, e.g. FSC-147, and demonstrate state-of-the-art performance on both zero and few-shot settings.