训练视觉变压器只有2040张图像

论文标题

训练视觉变压器只有2040张图像

Training Vision Transformers with Only 2040 Images

论文作者

Cao, Yun-Hao, Yu, Hao, Wu, Jianxin

论文摘要

视觉变压器（VIT）正在成为卷积神经网络（CNN）的替代品以进行视觉识别。他们通过CNN获得了竞争成果，但是缺乏典型的卷积感应偏见使它们比普通CNN更具数据渴望。它们通常在JFT-300M或至少ImageNet上估计，并且很少有数据培训培训培训。在本文中，我们研究了如何使用有限的数据训练VIT（例如2040张图像）。我们给出理论分析，即我们的方法（基于参数实例歧视）优于其他方法，因为它可以捕获特征比对和实例相似性。当在各种VIT骨架下的7个小数据集上从头开始训练时，我们就可以实现最先进的结果。我们还研究了小型数据集的转移能力，并发现从小型数据集中学到的表示形式甚至可以改善大规模的成像网训练。

Vision Transformers (ViTs) is emerging as an alternative to convolutional neural networks (CNNs) for visual recognition. They achieve competitive results with CNNs but the lack of the typical convolutional inductive bias makes them more data-hungry than common CNNs. They are often pretrained on JFT-300M or at least ImageNet and few works study training ViTs with limited data. In this paper, we investigate how to train ViTs with limited data (e.g., 2040 images). We give theoretical analyses that our method (based on parametric instance discrimination) is superior to other methods in that it can capture both feature alignment and instance similarities. We achieve state-of-the-art results when training from scratch on 7 small datasets under various ViT backbones. We also investigate the transferring ability of small datasets and find that representations learned from small datasets can even improve large-scale ImageNet training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题