论文标题
学习通过步进大小计划的方法来加速
Learning to Accelerate by the Methods of Step-size Planning
论文作者
论文摘要
梯度下降缓慢地融合了条件不足的问题和非凸问题。加速的重要技术是步骤尺寸的适应。本文的第一部分包含了详细的综述,包括polyak stepsize,L4,Lossgrad,Adam,IDBD和HyperGradient Descent,以及逐步适应与元分差方法的关系。在本文的第二部分中,我们提出了一种新的加速梯度下降方法,这些方法与现有技术具有独特性。我们称之为{\ em spepsize计划}的新方法使用{\ em Update Experience}来学习更新参数的改进方法。这些方法将经验组织成$ k $,彼此之间的距离,以促进计划。从过去的经验中,我们的计划算法CSAWG学习了一个尺寸尺寸的模型,该模型是一种多步机器的形式,可以预测未来的更新。我们将CSAWG扩展到应用步进大小的计划多个步骤,这导致进一步加速。我们讨论并突出显示对角线矩阵的投影能力,用于将来的大规模应用。 We show for a convex problem, our methods can surpass the convergence rate of Nesterov's accelerated gradient, $1 - \sqrt{μ/L}$, where $μ, L$ are the strongly convex factor of the loss function $F$ and the Lipschitz constant of $F'$, which is the theoretical limit for the convergence rate of first-order methods.在众所周知的非凸罗森布罗克功能上,我们的计划方法在500梯度评估以下达到零误差,而梯度下降则需要大约10000个梯度评估,以达到$ 10^{ - 3} $精度。我们讨论了阶梯规划与计划学习的计划,尤其是DYNA体系结构的联系。 (这是比论文中短的抽象,因为需要长度)
Gradient descent is slow to converge for ill-conditioned problems and non-convex problems. An important technique for acceleration is step-size adaptation. The first part of this paper contains a detailed review of step-size adaptation methods, including Polyak step-size, L4, LossGrad, Adam, IDBD, and Hypergradient descent, and the relation of step-size adaptation to meta-gradient methods. In the second part of this paper, we propose a new class of methods of accelerating gradient descent that have some distinctiveness from existing techniques. The new methods, which we call {\em step-size planning}, use the {\em update experience} to learn an improved way of updating the parameters. The methods organize the experience into $K$ steps away from each other to facilitate planning. From the past experience, our planning algorithm, Csawg, learns a step-size model which is a form of multi-step machine that predicts future updates. We extends Csawg to applying step-size planning multiple steps, which leads to further speedup. We discuss and highlight the projection power of the diagonal-matrix step-size for future large scale applications. We show for a convex problem, our methods can surpass the convergence rate of Nesterov's accelerated gradient, $1 - \sqrt{μ/L}$, where $μ, L$ are the strongly convex factor of the loss function $F$ and the Lipschitz constant of $F'$, which is the theoretical limit for the convergence rate of first-order methods. On the well-known non-convex Rosenbrock function, our planning methods achieve zero error below 500 gradient evaluations, while gradient descent takes about 10000 gradient evaluations to reach a $10^{-3}$ accuracy. We discuss the connection of step-size planing to planning in reinforcement learning, in particular, Dyna architectures. (This is a shorter abstract than in the paper because of length requirement)