论文标题
PatchDropout:使用补丁摘要的节能视觉变压器
PatchDropout: Economizing Vision Transformers Using Patch Dropout
论文作者
论文摘要
视觉变压器已经证明了在各种视觉任务中都超越CNN的潜力。但是这些模型的计算和内存要求禁止在许多应用中使用它们,尤其是依赖高分辨率图像的应用程序,例如医学图像分类。更有效地训练VIT的努力过于复杂,需要进行建筑变化或复杂的培训方案。在这项工作中,我们表明可以通过随机删除输入图像贴片来有效地以高分辨率进行标准VIT模型。在标准的自然图像数据集(例如ImageNet)中,这种简单的方法,PatchDropout,将拖鞋和内存减少至少50%,而这些节省仅随图像尺寸而增加。在高分辨率医疗数据集CSAW上,我们使用PatchDropout可节省5倍的计算和内存,并提高性能。对于具有固定计算或内存预算的从业人员,PatchDropout可以选择图像分辨率,超参数或模型大小以使其从模型中获得最大的性能。
Vision transformers have demonstrated the potential to outperform CNNs in a variety of vision tasks. But the computational and memory requirements of these models prohibit their use in many applications, especially those that depend on high-resolution images, such as medical image classification. Efforts to train ViTs more efficiently are overly complicated, necessitating architectural changes or intricate training schemes. In this work, we show that standard ViT models can be efficiently trained at high resolution by randomly dropping input image patches. This simple approach, PatchDropout, reduces FLOPs and memory by at least 50% in standard natural image datasets such as ImageNet, and those savings only increase with image size. On CSAW, a high-resolution medical dataset, we observe a 5 times savings in computation and memory using PatchDropout, along with a boost in performance. For practitioners with a fixed computational or memory budget, PatchDropout makes it possible to choose image resolution, hyperparameters, or model size to get the most performance out of their model.