DMFORMER：缩小CNN和视觉变压器之间的差距

论文标题

DMFORMER：缩小CNN和视觉变压器之间的差距

DMFormer: Closing the Gap Between CNN and Vision Transformers

论文作者

Wei, Zimian, Pan, Hengyue, Li, Lujun, Lu, Menglong, Niu, Xin, Dong, Peijie, Li, Dongsheng

论文摘要

视觉变压器在计算机视觉任务中表现出色。由于他们的自我发挥机制的计算成本很昂贵，最近的作品试图用卷积操作替代视觉变压器中的自我发挥机制，这具有内置的电感偏见，这更有效。但是，这些努力要么忽略了多层次的特征，要么缺乏动态的繁荣，从而导致了次优的表现。在本文中，我们提出了一种动态的多级注意机制（DMA），该机制通过多个内核大小捕获输入图像的不同模式，并启用具有门控机制的输入自适应权重。基于DMA，我们提供了一个名为DMFormer的高效骨干网络。 Dmformer采用了视觉变压器的整体体系结构，同时用我们提出的DMA代替了自我发挥的机制。 ImageNet-1K和ADE20K数据集的广泛实验结果表明，DMFormer实现了最先进的性能，这表现优于相似大小的视觉变压器（VIT）和卷积神经网络（CNNS）。

Vision transformers have shown excellent performance in computer vision tasks. As the computation cost of their self-attention mechanism is expensive, recent works tried to replace the self-attention mechanism in vision transformers with convolutional operations, which is more efficient with built-in inductive bias. However, these efforts either ignore multi-level features or lack dynamic prosperity, leading to sub-optimal performance. In this paper, we propose a Dynamic Multi-level Attention mechanism (DMA), which captures different patterns of input images by multiple kernel sizes and enables input-adaptive weights with a gating mechanism. Based on DMA, we present an efficient backbone network named DMFormer. DMFormer adopts the overall architecture of vision transformers, while replacing the self-attention mechanism with our proposed DMA. Extensive experimental results on ImageNet-1K and ADE20K datasets demonstrated that DMFormer achieves state-of-the-art performance, which outperforms similar-sized vision transformers(ViTs) and convolutional neural networks (CNNs).

下载PDF全文

下载文献需遵守相关版权规定

论文标题