视觉变压器可视化：哪些神经元说出以及神经元的表现？

论文标题

视觉变压器可视化：哪些神经元说出以及神经元的表现？

Vision Transformer Visualization: What Neurons Tell and How Neurons Behave?

论文作者

Nguyen, Van-Anh, Dinh, Khanh Pham, Vuong, Long Tung, Do, Thanh-Toan, Tran, Quan Hung, Phung, Dinh, Le, Trung

论文摘要

最近，视觉变压器（VIT）已成功地用于计算机视觉中的各种任务。但是，重要的问题，例如它们为什么工作或行为方式仍然很大程度上是未知的。在本文中，我们提出了一种有效的可视化技术，以帮助我们揭示神经元中携带的信息，并在VIT层中具有嵌入方式。我们的方法偏离了VIT的计算过程，重点是在输入图像中可视化本地和全局信息，并在多个层面上可视化潜在特征嵌入。在0级别的输入和嵌入式的可视化揭示了有趣的发现，例如提供有关为什么VIT通常对图像的闭合和贴片进行改组的支持；或与CNN不同，0级嵌入已经包含丰富的语义细节。接下来，我们开发了一个严格的框架，以跨层进行有效的可视化，暴露于VIT过滤器的效果以及对对象补丁的分组/聚类行为。最后，我们在实际数据集上提供了全面的实验，以定性和定量证明我们提出的方法以及我们的发现的优点。 https://github.com/bym1902/vit_visalization

Recently vision transformers (ViT) have been applied successfully for various tasks in computer vision. However, important questions such as why they work or how they behave still remain largely unknown. In this paper, we propose an effective visualization technique, to assist us in exposing the information carried in neurons and feature embeddings across the ViT's layers. Our approach departs from the computational process of ViTs with a focus on visualizing the local and global information in input images and the latent feature embeddings at multiple levels. Visualizations at the input and embeddings at level 0 reveal interesting findings such as providing support as to why ViTs are rather generally robust to image occlusions and patch shuffling; or unlike CNNs, level 0 embeddings already carry rich semantic details. Next, we develop a rigorous framework to perform effective visualizations across layers, exposing the effects of ViTs filters and grouping/clustering behaviors to object patches. Finally, we provide comprehensive experiments on real datasets to qualitatively and quantitatively demonstrate the merit of our proposed methods as well as our findings. https://github.com/byM1902/ViT_visualization

下载PDF全文

下载文献需遵守相关版权规定

论文标题