视觉变压器适配器用于密集预测

论文标题

视觉变压器适配器用于密集预测

Vision Transformer Adapter for Dense Predictions

论文作者

Chen, Zhe, Duan, Yuchen, Wang, Wenhai, He, Junjun, Lu, Tong, Dai, Jifeng, Qiao, Yu

论文摘要

这项工作调查了视觉变压器（VIT）的简单而强大的密集预测任务适配器。与最近将视觉特异性电感偏见融入其体系结构的最近高级变体不同，由于先前的假设较弱，普通的VIT在密集的预测上遭受了劣等的性能。为了解决这个问题，我们提出了VIT-ADAPTER，它允许普通的VIT达到与特定的特定变压器相当的性能。具体而言，我们框架中的骨干是一个普通的VIT，可以从大规模多模式数据中学习强大的表示形式。当转移到下游任务时，使用预先训练的适配器将与图像相关的电感偏置引入模型，使其适合这些任务。我们在多个密集的预测任务上验证VIT-ADAPTER，包括对象检测，实例分割和语义分割。值得注意的是，如果不使用额外的检测数据，我们的VIT-ADAPTER-L在可可测试-DEV上产生最新的60.9盒AP和53.0 Mask AP。我们希望VIT-ADAPTER可以作为特定视觉变压器的替代方案，并促进未来的研究。代码和模型将在https://github.com/czczup/vit-adapter上发布。

This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT). Unlike recently advanced variants that incorporate vision-specific inductive biases into their architectures, the plain ViT suffers inferior performance on dense predictions due to weak prior assumptions. To address this issue, we propose the ViT-Adapter, which allows plain ViT to achieve comparable performance to vision-specific transformers. Specifically, the backbone in our framework is a plain ViT that can learn powerful representations from large-scale multi-modal data. When transferring to downstream tasks, a pre-training-free adapter is used to introduce the image-related inductive biases into the model, making it suitable for these tasks. We verify ViT-Adapter on multiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation. Notably, without using extra detection data, our ViT-Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO test-dev. We hope that the ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. The code and models will be released at https://github.com/czczup/ViT-Adapter.

下载PDF全文

下载文献需遵守相关版权规定

论文标题