快速视力变压器引起了Hilo的关注

论文标题

快速视力变压器引起了Hilo的关注

Fast Vision Transformers with HiLo Attention

论文作者

Pan, Zizheng, Cai, Jianfei, Zhuang, Bohan

论文摘要

视觉变压器（VIT）触发了计算机视觉的最新和重大突破。它们的有效设计主要由计算复杂性的间接度量（即拖船）的指导，但是，flops与直接度量（例如吞吐量）具有明显的差距。因此，我们建议将目标平台上的直接速度评估作为有效VIT的设计原理。特别是，我们介绍了LITV2，这是一种简单有效的VIT，可与以更快的速度更快的不同模型大小相对现有的最新方法。 LITV2的核心是一种新颖的自我发项机制，我们将其配音。 Hilo的灵感来自于洞察力的启发：图像中的高频捕获本地细节，低频集中在全球结构上，而多头自我发言层则忽略了不同频率的特征。因此，我们建议通过将头部分为两组来解散注意力层中的高/低频模式，其中一组在每个本地窗口内通过自我注意来编码高频，而另一组通过从每个窗口中的平均低频键和每个Query位置之间的平均低频键和输入功能图中的每个Query位置之间进行全局注意来编码低频。从两组的有效设计中受益，我们表明希洛通过对GPU和CPU上的速度，速度和记忆消耗进行了全面测试，优于现有的注意机制。例如，希洛（Hilo）比空间减少注意力快1.4倍，比当地窗口上的窗户关注的速度快1.6倍。 LITV2由Hilo提供支持，是主流视觉任务的强大主链，包括图像分类，密集检测和分割。代码可在https://github.com/ziplab/litv2上找到。

Vision Transformers (ViTs) have triggered the most recent and significant breakthroughs in computer vision. Their efficient designs are mostly guided by the indirect metric of computational complexity, i.e., FLOPs, which however has a clear gap with the direct metric such as throughput. Thus, we propose to use the direct speed evaluation on the target platform as the design principle for efficient ViTs. Particularly, we introduce LITv2, a simple and effective ViT which performs favourably against the existing state-of-the-art methods across a spectrum of different model sizes with faster speed. At the core of LITv2 is a novel self-attention mechanism, which we dub HiLo. HiLo is inspired by the insight that high frequencies in an image capture local fine details and low frequencies focus on global structures, whereas a multi-head self-attention layer neglects the characteristic of different frequencies. Therefore, we propose to disentangle the high/low frequency patterns in an attention layer by separating the heads into two groups, where one group encodes high frequencies via self-attention within each local window, and another group encodes low frequencies by performing global attention between the average-pooled low-frequency keys and values from each window and each query position in the input feature map. Benefiting from the efficient design for both groups, we show that HiLo is superior to the existing attention mechanisms by comprehensively benchmarking FLOPs, speed and memory consumption on GPUs and CPUs. For example, HiLo is 1.4x faster than spatial reduction attention and 1.6x faster than local window attention on CPUs. Powered by HiLo, LITv2 serves as a strong backbone for mainstream vision tasks including image classification, dense detection and segmentation. Code is available at https://github.com/ziplab/LITv2.

下载PDF全文

下载文献需遵守相关版权规定

论文标题