将全球特征汇总到本地视觉变压器中

论文标题

将全球特征汇总到本地视觉变压器中

Aggregating Global Features into Local Vision Transformer

论文作者

Patel, Krushi, Bur, Andres M., Li, Fengjun, Wang, Guanghui

论文摘要

基于本地变压器的分类模型最近以相对较低的计算成本取得了有希望的结果。但是，尚不清楚汇总基于本地变压器架构的空间全局信息的效果。这项工作研究了每个阶段之后，在本地基于窗口的变压器中应用一个名为多分辨率重叠注意力（MOA）的全局基于注意力的模块的结果。拟议的MOA在钥匙中采用了稍大的和重叠的补丁，以实现邻里像素信息传输，从而导致绩效的显着增长。此外，我们通过广泛的实验彻底研究必需体系结构的维度的影响，并发现最佳的体系结构设计。广泛的实验结果CIFAR-10，CIFAR-100和Imagenet-1K数据集表明，所提出的方法的表现优于先前的视觉变压器，参数数量较少。

Local Transformer-based classification models have recently achieved promising results with relatively low computational costs. However, the effect of aggregating spatial global information of local Transformer-based architecture is not clear. This work investigates the outcome of applying a global attention-based module named multi-resolution overlapped attention (MOA) in the local window-based transformer after each stage. The proposed MOA employs slightly larger and overlapped patches in the key to enable neighborhood pixel information transmission, which leads to significant performance gain. In addition, we thoroughly investigate the effect of the dimension of essential architecture components through extensive experiments and discover an optimum architecture design. Extensive experimental results CIFAR-10, CIFAR-100, and ImageNet-1K datasets demonstrate that the proposed approach outperforms previous vision Transformers with a comparatively fewer number of parameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题