学习率随批量大小的函数：一种随机矩阵理论方法的神经网络培训方法

论文标题

学习率随批量大小的函数：一种随机矩阵理论方法的神经网络培训方法

Learning Rates as a Function of Batch Size: A Random Matrix Theory Approach to Neural Network Training

论文作者

Granziol, Diego, Zohren, Stefan, Roberts, Stephen

论文摘要

我们使用尖刺的，依赖于场的随机矩阵理论来研究迷你批次对深神经网络损失景观的影响。我们证明，批处理海森的极端价值的大小比经验Hessian的大小要大。我们还获得了Hessian的广义高斯矩阵近似值的相似结果。由于我们的定理，我们得出了最大学习速率的分析表达式，这是批量大小的函数，为随机梯度下降（线性缩放）和适应性算法的实用训练方案（如ADAM（平方根缩放）），即平滑，非连接深度神经网络。虽然在更严格的条件下得出了随机梯度下降的线性缩放，但据我们所知，适应性优化器的平方根缩放规则是完全新颖的。％对于随机二阶方法和自适应方法，我们得出的是，最小阻尼系数与学习率与批处理大小的比例成正比。我们在CIFAR- $ 100 $和IMAGENET数据集上验证了对VGG/WideSnet架构的主张。根据我们对子采样的Hessian的研究，我们根据蝇蝇学习率和动量学习者开发了随机的兰开斯正交正交，这避免了对这些关键超级参数的昂贵多重评估的需求，并在Cifar-$ 100 $ 100 $的情况下显示了良好的良好初步结果。

We study the effect of mini-batching on the loss landscape of deep neural networks using spiked, field-dependent random matrix theory. We demonstrate that the magnitude of the extremal values of the batch Hessian are larger than those of the empirical Hessian. We also derive similar results for the Generalised Gauss-Newton matrix approximation of the Hessian. As a consequence of our theorems we derive an analytical expressions for the maximal learning rates as a function of batch size, informing practical training regimens for both stochastic gradient descent (linear scaling) and adaptive algorithms, such as Adam (square root scaling), for smooth, non-convex deep neural networks. Whilst the linear scaling for stochastic gradient descent has been derived under more restrictive conditions, which we generalise, the square root scaling rule for adaptive optimisers is, to our knowledge, completely novel. %For stochastic second-order methods and adaptive methods, we derive that the minimal damping coefficient is proportional to the ratio of the learning rate to batch size. We validate our claims on the VGG/WideResNet architectures on the CIFAR-$100$ and ImageNet datasets. Based on our investigations of the sub-sampled Hessian we develop a stochastic Lanczos quadrature based on the fly learning rate and momentum learner, which avoids the need for expensive multiple evaluations for these key hyper-parameters and shows good preliminary results on the Pre-Residual Architecure for CIFAR-$100$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题