通过注意力训练数据有效的图像变压器和蒸馏

论文标题

通过注意力训练数据有效的图像变压器和蒸馏

Training data-efficient image transformers & distillation through attention

论文作者

Touvron, Hugo, Cord, Matthieu, Douze, Matthijs, Massa, Francisco, Sablayrolles, Alexandre, Jégou, Hervé

论文摘要

最近，纯粹基于注意力的神经网络被证明可以解决图像理解任务，例如图像分类。但是，这些视觉变压器使用昂贵的基础设施预先训练了数亿张图像，从而限制了它们的采用。在这项工作中，我们仅通过对Imagenet进行训练来生成无竞争性的无卷积变压器。我们在不到3天的时间内在一台计算机上训练它们。我们的参考视觉变压器（8600万参数）在没有外部数据的情况下，在Imagenet上实现了83.1％（单作评估）的TOP-1精度。更重要的是，我们介绍了针对变形金刚的教师策略。它依赖于蒸馏剂令牌，以确保学生通过注意力向老师学习。我们表明了这种基于令牌的蒸馏的兴趣，尤其是在使用Convnet作为老师时。这使我们报告了与Convnet的竞争性结果，这两个ImageNet（我们获得的精度最高为85.2％）以及转移到其他任务时。我们共享我们的代码和模型。

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题