论文标题
研究基于二阶梯度的神经网络的大批量训练
Study on the Large Batch Size Training of Neural Networks Based on the Second Order Gradient
论文作者
论文摘要
深度神经网络(DNN)中的大批量训练具有众所周知的“泛化差距”,可显着诱导泛化性能降解。但是,尚不清楚批处理大小如何影响NN的结构。在这里,我们将理论与实验结合在一起,以探索基本结构属性的演变,包括梯度,参数更新步长和损耗更新的步骤长度在不同的批处理大小下。我们提供了新的指导来改善概括,这通过两种涉及丢弃小损失样本和调度批量大小的设计方法进一步验证。提出了一种基于曲率的学习率(CBLR)算法,以更好地拟合曲率变化,这是跨NN层的敏感因素,影响大批量训练的敏感因素。作为CBLR的近似,发现中值LR(MCLR)算法可获得与层的自适应速率缩放(LARS)算法相当的性能。我们的理论结果和算法为现有研究提供了基于几何的解释。此外,我们证明了层明智的LR算法(例如LARS)可以被视为CBLR的特殊实例。最后,我们推断出大批量训练的理论几何图片,并表明所有网络参数都倾向于以其相关的最小值为中心。
Large batch size training in deep neural networks (DNNs) possesses a well-known 'generalization gap' that remarkably induces generalization performance degradation. However, it remains unclear how varying batch size affects the structure of a NN. Here, we combine theory with experiments to explore the evolution of the basic structural properties, including gradient, parameter update step length, and loss update step length of NNs under varying batch sizes. We provide new guidance to improve generalization, which is further verified by two designed methods involving discarding small-loss samples and scheduling batch size. A curvature-based learning rate (CBLR) algorithm is proposed to better fit the curvature variation, a sensitive factor affecting large batch size training, across layers in a NN. As an approximation of CBLR, the median-curvature LR (MCLR) algorithm is found to gain comparable performance to Layer-wise Adaptive Rate Scaling (LARS) algorithm. Our theoretical results and algorithm offer geometry-based explanations to the existing studies. Furthermore, we demonstrate that the layer wise LR algorithms, for example LARS, can be regarded as special instances of CBLR. Finally, we deduce a theoretical geometric picture of large batch size training, and show that all the network parameters tend to center on their related minima.