MAXVIT：多轴视觉变压器

论文标题

MAXVIT：多轴视觉变压器

MaxViT: Multi-Axis Vision Transformer

论文作者

Tu, Zhengzhong, Talebi, Hossein, Zhang, Han, Yang, Feng, Milanfar, Peyman, Bovik, Alan, Li, Yinxiao

论文摘要

变形金刚最近在计算机视觉社区中引起了极大的关注。但是，缺乏关于图像大小的自我注意力机制的可扩展性限制了它们在最先进的视觉骨架中的广泛采用。在本文中，我们介绍了一种高效且可扩展的注意模型，我们称之为多轴注意，该模型包括两个方面：阻止局部和扩张的全球关注。这些设计选择允许在只有线性复杂性的任意输入分辨率上进行全局本地空间相互作用。我们还通过有效地将我们提出的注意模型与卷积融合在一起，提出了一个新的体系结构元素，因此，通过简单地在多个阶段重复基本的构建块，提出了一个简单的层次视觉主链，称为Maxvit。值得注意的是，即使在早期的高分辨率阶段，Maxvit也能够在整个网络中“看到”。我们证明了模型对各种视觉任务的有效性。在图像分类上，Maxvit在各种设置下实现了最先进的性能：没有额外的数据，Maxvit获得了86.5％的Imagenet-1K Top-1精度；使用Imagenet-21K预训练，我们的模型可实现88.7％的TOP-1精度。对于下游任务，麦克斯维特（Maxvit）作为骨架可在对象检测以及视觉美学评估方面提供有利的性能。我们还表明，我们提出的模型表达了ImageNet上强大的生成建模能力，这表明了Maxvit块作为通用视觉模块的优势潜力。源代码和训练有素的模型将在https://github.com/google-research/maxvit上找到。

Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to ''see'' globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. The source code and trained models will be available at https://github.com/google-research/maxvit.

下载PDF全文

下载文献需遵守相关版权规定

论文标题