论文标题
迷你批量动量的轨迹:批量尺寸饱和度和收敛
Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions
论文作者
论文摘要
当样本数量和尺寸较大时,我们在最小二乘问题上分析了具有动量(SGD+M)的大批量随机梯度下降的动力学。在这种情况下,我们表明,随着尺寸的增加,SGD+M的动力学会收敛到确定性离散的Volterra方程,我们进行了分析。我们确定稳定性测量,即隐式调节比(ICR),该比率(ICR)调节SGD+M加速算法的能力。当批处理大小超过此ICR时,SGD+M以$ \ MATHCAL {O}(1/\SQRTκ)$线性收敛,与最佳的全零件动量相匹配(尤其是表现,并且具有全批次,但大小的一小部分)。相比之下,对于小于ICR的批量尺寸,SGD+M的速率像单批次SGD速率的倍数一样。我们从实现此性能的Hessian光谱方面为学习率和动量参数提供明确的选择。
We analyze the dynamics of large batch stochastic gradient descent with momentum (SGD+M) on the least squares problem when both the number of samples and dimensions are large. In this setting, we show that the dynamics of SGD+M converge to a deterministic discrete Volterra equation as dimension increases, which we analyze. We identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of $\mathcal{O}(1/\sqrtκ)$, matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance.