论文标题
CF-VIT:视觉变压器的一般粗到精细方法
CF-ViT: A General Coarse-to-Fine Method for Vision Transformer
论文作者
论文摘要
视觉变压器(VIT)在计算机视觉任务中取得了许多突破。但是,输入图像的空间维度出现了相当大的冗余,导致了巨大的计算成本。因此,我们提出了一个粗到精细的视觉变压器(CF-VIT),以减轻计算负担,同时在本文中保持绩效。我们提出的CF-VIT是由现代VIT模型中的两个重要观察结果激励的:(1)粗粒斑分裂可以找到输入图像的信息区域。 (2)大多数图像可以通过小型令牌序列中的VIT模型很好地识别。因此,我们的CF-Vit以两阶段的方式实现网络推断。在粗糙的推理阶段,输入图像被分为一个小长度贴片序列,以进行计算经济分类。如果不太认识,则可以确定信息的斑块,并在细粒度的细粒度中进一步重新分解。广泛的实验证明了我们的CF-VIT功效。例如,在不妥协性能的情况下,CF-VIT降低了53%的LV-VIT拖鞋,并且还可以达到2.01倍的吞吐量。
Vision Transformers (ViT) have made many breakthroughs in computer vision tasks. However, considerable redundancy arises in the spatial dimension of an input image, leading to massive computational costs. Therefore, We propose a coarse-to-fine vision transformer (CF-ViT) to relieve computational burden while retaining performance in this paper. Our proposed CF-ViT is motivated by two important observations in modern ViT models: (1) The coarse-grained patch splitting can locate informative regions of an input image. (2) Most images can be well recognized by a ViT model in a small-length token sequence. Therefore, our CF-ViT implements network inference in a two-stage manner. At coarse inference stage, an input image is split into a small-length patch sequence for a computationally economical classification. If not well recognized, the informative patches are identified and further re-split in a fine-grained granularity. Extensive experiments demonstrate the efficacy of our CF-ViT. For example, without any compromise on performance, CF-ViT reduces 53% FLOPs of LV-ViT, and also achieves 2.01x throughput.