并非所有的补丁都是您需要的：通过令牌重组加快视觉变形金刚

论文标题

并非所有的补丁都是您需要的：通过令牌重组加快视觉变形金刚

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

论文作者

Liang, Youwei, Ge, Chongjian, Tong, Zhan, Song, Yibing, Wang, Jue, Xie, Pengtao

论文摘要

视觉变压器（VIT）将所有图像贴片作为令牌，并构建多头自我注意力（MHSA）。这些图像令牌的完整杠杆作用带来了冗余的计算，因为并非所有令牌在MHSA中都很专注。例如，包含语义上毫无意义或分心的图像背景的令牌并不能对VIT预测产生积极的贡献。在这项工作中，我们建议在VIT模型的馈送过程中重新组织图像令牌，该过程在训练过程中已整合到VIT中。对于每个前进推断，我们都会确定MHSA和FFN（即Feed-Forward网络）模块之间的细心图像令牌，该模块由相应的类令牌注意力指导。然后，我们通过保留细心的图像令牌并融合注意力不集中以加快随后的MHSA和FFN计算来重新组织图像令牌。为此，我们的方法从两个角度来改善了VIT。首先，在相同数量的输入图像令牌下，我们的方法降低了MHSA和FFN计算以提高推断。例如，对于Imagenet分类，DEIT-S的推理速度增加了50％，而其识别精度仅降低0.3％。其次，通过保持相同的计算成本，我们的方法使VIT能够将更多的图像令牌作为输入，以提高识别精度，而图像令牌来自更高分辨率的图像。一个例子是，我们将DEIT-S的识别精度提高了1％的ImageNet分类，其计算成本与香草DEIT-S相同的计算成本。同时，我们的方法不会引入更多参数。标准基准的实验显示了我们方法的有效性。该代码可从https://github.com/youweiliang/evit获得

Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them. Complete leverage of these image tokens brings redundant computations since not all the tokens are attentive in MHSA. Examples include that tokens containing semantically meaningless or distractive image backgrounds do not positively contribute to the ViT predictions. In this work, we propose to reorganize image tokens during the feed-forward process of ViT models, which is integrated into ViT during training. For each forward inference, we identify the attentive image tokens between MHSA and FFN (i.e., feed-forward network) modules, which is guided by the corresponding class token attention. Then, we reorganize image tokens by preserving attentive image tokens and fusing inattentive ones to expedite subsequent MHSA and FFN computations. To this end, our method EViT improves ViTs from two perspectives. First, under the same amount of input image tokens, our method reduces MHSA and FFN computation for efficient inference. For instance, the inference speed of DeiT-S is increased by 50% while its recognition accuracy is decreased by only 0.3% for ImageNet classification. Second, by maintaining the same computational cost, our method empowers ViTs to take more image tokens as input for recognition accuracy improvement, where the image tokens are from higher resolution images. An example is that we improve the recognition accuracy of DeiT-S by 1% for ImageNet classification at the same computational cost of a vanilla DeiT-S. Meanwhile, our method does not introduce more parameters to ViTs. Experiments on the standard benchmarks show the effectiveness of our method. The code is available at https://github.com/youweiliang/evit

下载PDF全文

下载文献需遵守相关版权规定

论文标题