汇总金字塔视觉变压器：图像识别的拆分转换策略而无需卷积

论文标题

汇总金字塔视觉变压器：图像识别的拆分转换策略而无需卷积

Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy for Image Recognition without Convolutions

论文作者

Ju, Rui-Yang, Lin, Ting-Yu, Chiang, Jen-Shiun, Jian, Jia-Hao, Lin, Yu-Shian, Huang, Liu-Rui-Yi

论文摘要

随着变压器在自然语言处理领域的成就，编码器编码器和变压器中的注意机制已应用于计算机视觉。最近，在计算机视觉的多个任务（图像分类，对象检测，语义细分等）中，最先进的卷积神经网络引入了一些变压器的概念。这证明了变压器在图像识别领域具有良好的前景。提出了视觉变压器后，越来越多的工作开始使用自我注意力来完全替代卷积层。这项工作基于视觉变压器，并结合了金字塔体系结构，使用拆分转换 - 合并来提出组编码器并将网络体系结构汇总的金字塔视觉变压器（APVT）命名。我们在CIFAR-10数据集上执行图像分类任务，并在COCO 2017数据集上执行对象检测任务。与其他使用变压器作为骨干的网络体系结构相比，APVT在降低计算成本的同时具有出色的结果。我们希望这种改进的策略可以为计算机视觉中的未来变压器研究提供参考。

With the achievements of Transformer in the field of natural language processing, the encoder-decoder and the attention mechanism in Transformer have been applied to computer vision. Recently, in multiple tasks of computer vision (image classification, object detection, semantic segmentation, etc.), state-of-the-art convolutional neural networks have introduced some concepts of Transformer. This proves that Transformer has a good prospect in the field of image recognition. After Vision Transformer was proposed, more and more works began to use self-attention to completely replace the convolutional layer. This work is based on Vision Transformer, combined with the pyramid architecture, using Split-transform-merge to propose the group encoder and name the network architecture Aggregated Pyramid Vision Transformer (APVT). We perform image classification tasks on the CIFAR-10 dataset and object detection tasks on the COCO 2017 dataset. Compared with other network architectures that use Transformer as the backbone, APVT has excellent results while reducing the computational cost. We hope this improved strategy can provide a reference for future Transformer research in computer vision.

下载PDF全文

下载文献需遵守相关版权规定

论文标题