轻巧的视觉变压器具有交叉功能注意力

论文标题

轻巧的视觉变压器具有交叉功能注意力

Lightweight Vision Transformer with Cross Feature Attention

论文作者

Zhao, Youpeng, Tang, Huadong, Jiang, Yingying, A, Yong, Wu, Qiang

论文摘要

视觉变压器（VIT）的最新进展在视觉识别任务中取得了出色的表现。卷积神经网络（CNN）利用空间电感偏见来学习视觉表示，但是这些网络在空间上是局部的。 VIT可以通过其自我注意力机制学习全球表示形式，但它们通常是重量重量，不适合移动设备。在本文中，我们提出了交叉特征注意（XFA），以降低变压器的计算成本，并将有效的移动CNN结合起来形成一种新型有效的轻质CNN-VIT混合模型Xformer，它可以用作通用的骨干，以学习全球和本地表示。实验结果表明，Xformer在不同的任务和数据集上的表现优于大量CNN和基于VIT的模型。在ImagEnet1k数据集上，Xformer以550万参数的速度达到了78.5％的TOP-1精度，比有效网络-B0（基于CNN）（基于CNN）和DEIT（基于VIT）（基于VIT）的参数高2.2％和6.3％。当转移到对象检测和语义分割任务时，我们的模型也表现良好。在MS Coco数据集上，Xformer在Yolov3框架中仅超过10.5 AP（22.7-> 33.2 AP），只有630万参数和3.8克Flops。在CityScapes数据集上，只有一个简单的全MLP解码器，Xformer的MIOU为78.5，FPS为15.3，超过了最先进的轻量级分割网络。

Recent advances in vision transformers (ViTs) have achieved great performance in visual recognition tasks. Convolutional neural networks (CNNs) exploit spatial inductive bias to learn visual representations, but these networks are spatially local. ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for mobile devices. In this paper, we propose cross feature attention (XFA) to bring down computation cost for transformers, and combine efficient mobile CNNs to form a novel efficient light-weight CNN-ViT hybrid model, XFormer, which can serve as a general-purpose backbone to learn both global and local representation. Experimental results show that XFormer outperforms numerous CNN and ViT-based models across different tasks and datasets. On ImageNet1K dataset, XFormer achieves top-1 accuracy of 78.5% with 5.5 million parameters, which is 2.2% and 6.3% more accurate than EfficientNet-B0 (CNN-based) and DeiT (ViT-based) for similar number of parameters. Our model also performs well when transferring to object detection and semantic segmentation tasks. On MS COCO dataset, XFormer exceeds MobileNetV2 by 10.5 AP (22.7 -> 33.2 AP) in YOLOv3 framework with only 6.3M parameters and 3.8G FLOPs. On Cityscapes dataset, with only a simple all-MLP decoder, XFormer achieves mIoU of 78.5 and FPS of 15.3, surpassing state-of-the-art lightweight segmentation networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题