可视化和理解视觉变压器中的补丁相互作用

论文标题

可视化和理解视觉变压器中的补丁相互作用

Visualizing and Understanding Patch Interactions in Vision Transformer

论文作者

Ma, Jie, Bai, Yalong, Zhong, Bineng, Zhang, Wei, Yao, Ting, Mei, Tao

论文摘要

Vision Transformer（VIT）已成为各种计算机视觉任务的领先工具，因为它的独特自我发挥机制，通过交叉绘制信息的交互明确学习视觉表示。尽管取得了良好的成功，但文献很少探讨视觉变压器的解释性，并且尚无清晰的了解如何相对于跨综合斑块的相关性如何影响性能以及进一步的潜力。在这项工作中，我们提出了一种可解释的可视化方法，以分析和解释视觉变压器斑块之间的关键注意力相互作用。具体而言，我们首先引入一个量化指标，以测量斑块相互作用的影响，并验证注意窗口设计和不加选择的斑块去除。然后，我们在VIT中利用每个贴片的有效响应磁场，并相应地设计一个无窗的变压器体系结构。对成像网的广泛实验表明，精心设计的定量方法显示能够促进VIT模型学习，最多可将TOP-1精度提高4.28％。此外，下游细粒识别任务的结果进一步验证了我们的提议的概括。

Vision Transformer (ViT) has become a leading tool in various computer vision tasks, owing to its unique self-attention mechanism that learns visual representations explicitly through cross-patch information interactions. Despite having good success, the literature seldom explores the explainability of vision transformer, and there is no clear picture of how the attention mechanism with respect to the correlation across comprehensive patches will impact the performance and what is the further potential. In this work, we propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer. Specifically, we first introduce a quantification indicator to measure the impact of patch interaction and verify such quantification on attention window design and indiscriminative patches removal. Then, we exploit the effective responsive field of each patch in ViT and devise a window-free transformer architecture accordingly. Extensive experiments on ImageNet demonstrate that the exquisitely designed quantitative method is shown able to facilitate ViT model learning, leading the top-1 accuracy by 4.28% at most. Moreover, the results on downstream fine-grained recognition tasks further validate the generalization of our proposal.

下载PDF全文

下载文献需遵守相关版权规定

论文标题