图像识别的全球自我注意力网络

论文标题

图像识别的全球自我注意力网络

Global Self-Attention Networks for Image Recognition

论文作者

Shen, Zhuoran, Bello, Irwan, Vemulapalli, Raviteja, Jia, Xuhui, Chen, Ching-Hui

论文摘要

最近，一系列计算机视觉中的作品显示了使用自我注意力的各种图像和视频理解任务的有希望的结果。但是，由于自我注意的二次计算和记忆复杂性，这些作品要么仅将注意力应用于深网的后期阶段的低分辨率特征图，要么限制了每一层中注意力领域的接受领域。为了克服这些局限性，这项工作引入了一个新的全球自我发场模块，称为GSA模块，该模块足够有效，可以作为深网的骨干组成部分。该模块由两个平行层组成：一个内容注意力层，仅根据其内容和位置注意力层关注像素，该位置注意力层根据其空间位置关注像素。该模块的输出是两层输出的总和。根据提出的GSA模块，我们引入了新的独立全局基于全局注意的深网，这些网络使用GSA模块而不是卷积来建模像素相互作用。由于提出的GSA模块的全局范围，GSA网络具有在整个网络中建模远程像素交互的能力。我们的实验结果表明，GSA网络在使用较少的参数和计算时，在CIFAR-100和Imagenet数据集上的基于相应的基于相应的网络的表现显着胜过相应的基于卷积的网络。提出的GSA网络还优于ImageNet数据集上的各种基于注意力的网络。

Recently, a series of works in computer vision have shown promising results on various image and video understanding tasks using self-attention. However, due to the quadratic computational and memory complexities of self-attention, these works either apply attention only to low-resolution feature maps in later stages of a deep network or restrict the receptive field of attention in each layer to a small local region. To overcome these limitations, this work introduces a new global self-attention module, referred to as the GSA module, which is efficient enough to serve as the backbone component of a deep network. This module consists of two parallel layers: a content attention layer that attends to pixels based only on their content and a positional attention layer that attends to pixels based on their spatial locations. The output of this module is the sum of the outputs of the two layers. Based on the proposed GSA module, we introduce new standalone global attention-based deep networks that use GSA modules instead of convolutions to model pixel interactions. Due to the global extent of the proposed GSA module, a GSA network has the ability to model long-range pixel interactions throughout the network. Our experimental results show that GSA networks outperform the corresponding convolution-based networks significantly on the CIFAR-100 and ImageNet datasets while using less parameters and computations. The proposed GSA networks also outperform various existing attention-based networks on the ImageNet dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题