论文标题
计划加速随机梯度下降的计划重新启动动量
Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent
论文作者
论文摘要
具有恒定动量的随机梯度下降(SGD)及其变体(如Adam)是训练深神经网络(DNNS)的优化算法。由于DNN培训在计算上非常昂贵,因此对加速融合非常感兴趣。 Nesterov加速梯度(NAG)提高了使用特殊设计的动量来优化梯度下降(GD)的收敛速率;但是,当使用不精确梯度(例如在SGD中)时,它会累积误差,最多会减慢收敛性并在最坏情况下发散。在本文中,我们提出了预定的重新启动SGD(SRSGD),这是一种用于培训DNNS的新型NAG风格的计划。 SRSGD用NAG中的动量增加了SGD中的恒定动量,但通过根据时间表将动量重置为零来稳定迭代。使用各种模型和基准进行图像分类,我们证明,在训练DNN中,SRSGD显着改善了收敛和泛化。例如,在用于Imagenet分类的培训RESNET200中,SRSGD的错误率为20.93%,而基准为22.13%。随着网络的增长,这些改进变得越来越重要。此外,与SGD基线相比,在CIFAR和Imagenet上,SRSGD的训练时期明显较少,训练时期明显少得多。
Stochastic gradient descent (SGD) with constant momentum and its variants such as Adam are the optimization algorithms of choice for training deep neural networks (DNNs). Since DNN training is incredibly computationally expensive, there is great interest in speeding up the convergence. Nesterov accelerated gradient (NAG) improves the convergence rate of gradient descent (GD) for convex optimization using a specially designed momentum; however, it accumulates error when an inexact gradient is used (such as in SGD), slowing convergence at best and diverging at worst. In this paper, we propose Scheduled Restart SGD (SRSGD), a new NAG-style scheme for training DNNs. SRSGD replaces the constant momentum in SGD by the increasing momentum in NAG but stabilizes the iterations by resetting the momentum to zero according to a schedule. Using a variety of models and benchmarks for image classification, we demonstrate that, in training DNNs, SRSGD significantly improves convergence and generalization; for instance in training ResNet200 for ImageNet classification, SRSGD achieves an error rate of 20.93% vs. the benchmark of 22.13%. These improvements become more significant as the network grows deeper. Furthermore, on both CIFAR and ImageNet, SRSGD reaches similar or even better error rates with significantly fewer training epochs compared to the SGD baseline.