全球背景视觉变形金刚

论文标题

全球背景视觉变形金刚

Global Context Vision Transformers

论文作者

Hatamizadeh, Ali, Yin, Hongxu, Heinrich, Greg, Kautz, Jan, Molchanov, Pavlo

论文摘要

我们提出了全球环境视觉变压器（GC VIT），这是一种新型体系结构，可增强参数并计算计算机视觉利用。我们的方法利用与标准的本地自我注意力结合的全球自我发场模块有效，有效地对漫长和短距离的空间相互作用进行建模，而无需昂贵的操作，例如计算注意力面罩或移动本地窗户。此外，我们解决了VIT中缺乏归纳偏见，并提议利用建筑中修改的融合倒置残差块。我们提出的GC VIT在图像分类，对象检测和语义分割任务中实现了最新的结果。 On ImageNet-1K dataset for classification, the variants of GC ViT with 51M, 90M and 201M parameters achieve 84.3%, 85.0% and 85.7% Top-1 accuracy, respectively, at 224 image resolution and without any pre-training, hence surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based MaxViT and Swin Transformer by a large margin.在对象检测，实例分割和使用MS Coco和ADE20K数据集的下游任务中，预训练的GC VIT骨架在对象检测，实例分割和语义分割的任务中始终如一地优于事先工作。具体而言，具有4级恐龙检测头的GC VIT在MS COCO数据集上实现了58.3的盒子AP。

We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision. Our method leverages global context self-attention modules, joint with standard local self-attention, to effectively and efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows. In addition, we address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the variants of GC ViT with 51M, 90M and 201M parameters achieve 84.3%, 85.0% and 85.7% Top-1 accuracy, respectively, at 224 image resolution and without any pre-training, hence surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based MaxViT and Swin Transformer by a large margin. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation using MS COCO and ADE20K datasets outperform prior work consistently. Specifically, GC ViT with a 4-scale DINO detection head achieves a box AP of 58.3 on MS COCO dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题