更平坦，更快：SGD最佳加速的缩放动量

论文标题

更平坦，更快：SGD最佳加速的缩放动量

Flatter, faster: scaling momentum for optimal speedup of SGD

论文作者

Cowsik, Aditya, Can, Tankut, Glorioso, Paolo

论文摘要

常用的优化算法通常显示出良好的概括和快速训练时间之间的权衡。例如，随机梯度下降（SGD）倾向于具有良好的概括。但是，自适应梯度方法具有较高的训练时间。动力可以帮助加速使用SGD加速培训，但到目前为止，还没有选择动量超参数的原则方法。在这里，我们研究了由SGD与标签噪声与动量之间的相互作用引起的训练动力学，在训练过多散热的神经网络中。我们发现，将动量超参数$ 1-β$扩展到学习率$ 2/3 $的功率最大加速训练，而不会牺牲概括。为了在分析上得出这一结果，我们开发了独立于体系结构的框架，其中主要假设是存在全局最小化器的退化歧管，而在过度隔离的模型中也是自然的。训练动力学表明了两个特征时间尺度的出现，这些时间尺度对于超参数的通用值进行了良好的分配。当这两个时间尺度相遇时，达到了最大的训练加速度，这又决定了我们提出的缩放限制。我们确认了合成回归问题的缩放规则（矩阵传感和教师学生范式）和现实数据集的分类（cifar10上的RESNET-18，6层MLP在FashionMnist上），这表明我们对建筑和数据集中的变体的缩放规则的鲁棒性。

Commonly used optimization algorithms often show a trade-off between good generalization and fast training times. For instance, stochastic gradient descent (SGD) tends to have good generalization; however, adaptive gradient methods have superior training times. Momentum can help accelerate training with SGD, but so far there has been no principled way to select the momentum hyperparameter. Here we study training dynamics arising from the interplay between SGD with label noise and momentum in the training of overparametrized neural networks. We find that scaling the momentum hyperparameter $1-β$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization. To analytically derive this result we develop an architecture-independent framework, where the main assumption is the existence of a degenerate manifold of global minimizers, as is natural in overparametrized models. Training dynamics display the emergence of two characteristic timescales that are well-separated for generic values of the hyperparameters. The maximum acceleration of training is reached when these two timescales meet, which in turn determines the scaling limit we propose. We confirm our scaling rule for synthetic regression problems (matrix sensing and teacher-student paradigm) and classification for realistic datasets (ResNet-18 on CIFAR10, 6-layer MLP on FashionMNIST), suggesting the robustness of our scaling rule to variations in architectures and datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题