朝着高效且可扩展的清晰度最小化

论文标题

朝着高效且可扩展的清晰度最小化

Towards Efficient and Scalable Sharpness-Aware Minimization

论文作者

Liu, Yong, Mai, Siqi, Chen, Xiangning, Hsieh, Cho-Jui, You, Yang

论文摘要

最近，连接损失景观和概括的几何形状的清晰度最小化（SAM）在训练大规模模型（例如视觉变压器）上显示出显着的性能提高。但是，SAM的更新规则需要在每个步骤上进行两个顺序（不可行的）梯度计算，这可以使计算开销增加一倍。在本文中，我们提出了一种新颖的算法外观 - 仅定期计算内部梯度上升，以显着降低SAM的额外训练成本。经验结果表明，看起来与SAM相似的同时获得了相似的准确性，同时又获得了更快的速度 - 它具有与SGD或ADAM这样的一阶优化器的可比计算复杂性。为了进一步评估LookAm的性能和可扩展性，我们在大批量训练方案中进行了层次修改并进行实验，这更容易收敛到尖锐的本地最小值。当训练视觉变压器（VIT）时，我们是第一个成功扩大批量规模的人。凭借64K批量的大小，我们能够在几分钟内从头开始训练VIT，同时保持竞争性能。

Recently, Sharpness-Aware Minimization (SAM), which connects the geometry of the loss landscape and generalization, has demonstrated significant performance boosts on training large-scale models such as vision transformers. However, the update rule of SAM requires two sequential (non-parallelizable) gradient computations at each step, which can double the computational overhead. In this paper, we propose a novel algorithm LookSAM - that only periodically calculates the inner gradient ascent, to significantly reduce the additional training cost of SAM. The empirical results illustrate that LookSAM achieves similar accuracy gains to SAM while being tremendously faster - it enjoys comparable computational complexity with first-order optimizers such as SGD or Adam. To further evaluate the performance and scalability of LookSAM, we incorporate a layer-wise modification and perform experiments in the large-batch training scenario, which is more prone to converge to sharp local minima. We are the first to successfully scale up the batch size when training Vision Transformers (ViTs). With a 64k batch size, we are able to train ViTs from scratch in minutes while maintaining competitive performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题