自适应分裂融合变压器

论文标题

自适应分裂融合变压器

Adaptive Split-Fusion Transformer

论文作者

Su, Zixuan, Zhang, Hao, Chen, Jingjing, Pang, Lei, Ngo, Chong-Wah, Jiang, Yu-Gang

论文摘要

用于视觉内容理解的神经网络最近从卷积卷积（CNN）演变为变压器。先前的（CNN）依靠小窗口的内核来捕获区域线索，表现出稳固的局部表现力。相反，后者（变压器）建立了整体学习的地方之间的远程全球联系。受这种互补性的启发，人们对设计混合模型以最好地利用每种技术的兴趣越来越兴趣。当前的混合动力仅将卷积替换为线性投影的简单近似值或引起卷积分支并列的，而无需考虑局部/全局建模的重要性。为了解决这个问题，我们提出了一种新的混合体自适应分裂融合变压器（ASF形式），以对自适应重量进行不同的处理和注意力分支。具体而言，ASF形式编码器同样将特征通道分为一半，以适合双路径输入。然后，双路径的输出与根据视觉提示计算的加权标量融合。我们还紧凑地设计卷积路径，以解决效率问题。对标准基准测试的广泛实验，例如Imagenet-1K，CIFAR-10和CIFAR-100，表明我们的ASF格式在准确性方面优于其CNN，变压器对应物和混合飞行员（在Imagenet-1K上为83.9％），在相似的条件下（12.9g Macs/56.7gs/56.77mmsss），而没有较大的范围。该代码可在以下网址提供：https：//github.com/szx503045266/asf-former。

Neural networks for visual content understanding have recently evolved from convolutional ones (CNNs) to transformers. The prior (CNN) relies on small-windowed kernels to capture the regional clues, demonstrating solid local expressiveness. On the contrary, the latter (transformer) establishes long-range global connections between localities for holistic learning. Inspired by this complementary nature, there is a growing interest in designing hybrid models to best utilize each technique. Current hybrids merely replace convolutions as simple approximations of linear projection or juxtapose a convolution branch with attention, without concerning the importance of local/global modeling. To tackle this, we propose a new hybrid named Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights. Specifically, an ASF-former encoder equally splits feature channels into half to fit dual-path inputs. Then, the outputs of dual-path are fused with weighting scalars calculated from visual cues. We also design the convolutional path compactly for efficiency concerns. Extensive experiments on standard benchmarks, such as ImageNet-1K, CIFAR-10, and CIFAR-100, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy (83.9% on ImageNet-1K), under similar conditions (12.9G MACs/56.7M Params, without large-scale pre-training). The code is available at: https://github.com/szx503045266/ASF-former.

下载PDF全文

下载文献需遵守相关版权规定

论文标题