论文标题
嫁接视觉变压器
Grafting Vision Transformers
论文作者
论文摘要
视觉变形金刚(VIT)最近已成为许多计算机视觉任务的最先进。与卷积网络(CNN)相反,VIT即使在网络的浅层层(即高分辨率特征)中也可以实现全球信息共享。但是,随着金字塔体系结构(例如Swin Transformer)的成功,这种振兴被忽视了,Swin Transformer表现出更好的性能复杂性权衡。在本文中,我们提出了一个简单有效的附加组件(称为移植物),该组件在高分辨率和低分辨率功能中都考虑了整个网络中的全局依赖关系和多规模信息。它具有在任意深度分支的灵活性,并共享主链的大多数参数和计算。移植物在各种众所周知的模型上显示出一致的增长,包括均匀和金字塔结构以及各种自我发项方法,包括混合和纯变压器类型。特别是,它在很大程度上通过提供高级语义来使移动大小的模型受益。在ImagEnet-1K数据集上,GRAFT分别提供 +3.9%, +1.4%和 +1.9%的DEIT-T,SWIN-T和MobileVit-XXS的TOP-1 TOP-1精度提高。我们的代码和模型将提供。
Vision Transformers (ViTs) have recently become the state-of-the-art across many computer vision tasks. In contrast to convolutional networks (CNNs), ViTs enable global information sharing even within shallow layers of a network, i.e., among high-resolution features. However, this perk was later overlooked with the success of pyramid architectures such as Swin Transformer, which show better performance-complexity trade-offs. In this paper, we present a simple and efficient add-on component (termed GrafT) that considers global dependencies and multi-scale information throughout the network, in both high- and low-resolution features alike. It has the flexibility of branching out at arbitrary depths and shares most of the parameters and computations of the backbone. GrafT shows consistent gains over various well-known models which includes both hybrid and pure Transformer types, both homogeneous and pyramid structures, and various self-attention methods. In particular, it largely benefits mobile-size models by providing high-level semantics. On the ImageNet-1k dataset, GrafT delivers +3.9%, +1.4%, and +1.9% top-1 accuracy improvement to DeiT-T, Swin-T, and MobileViT-XXS, respectively. Our code and models will be made available.