论文标题
深度学习中抑制的随机矩阵理论方法
A Random Matrix Theory Approach to Damping in Deep Learning
论文作者
论文摘要
我们猜想,深度学习中自适应和非自适应梯度方法之间概括的固有差异源于真正损失表面最坦率的估计噪声的增加。我们证明,用于自适应方法(具有低数值稳定性或阻尼常数)的典型时间表可导致相对向相对于平坦方向的相对运动相对于锐利方向,从而有效地扩大了噪声对信号比率和损害概括。我们进一步证明,这些方法中使用的数值阻尼常数可以分解为估计曲率矩阵的学习率降低和线性收缩。然后,我们通过增加收缩系数来表现出显着的概括改进,从而完全缩小了逻辑回归和几个深神经网络实验中的概括差距。进一步扩展这一线,我们为受线性收缩估计的启发而开发了一种基于二阶优化器的基于新型的随机矩阵理论。我们在实验上证明我们的学习者对初始化价值非常不敏感,并允许与持续稳定的训练和竞争性概括结合使用非常快速的收敛。
We conjecture that the inherent difference in generalisation between adaptive and non-adaptive gradient methods in deep learning stems from the increased estimation noise in the flattest directions of the true loss surface. We demonstrate that typical schedules used for adaptive methods (with low numerical stability or damping constants) serve to bias relative movement towards flat directions relative to sharp directions, effectively amplifying the noise-to-signal ratio and harming generalisation. We further demonstrate that the numerical damping constant used in these methods can be decomposed into a learning rate reduction and linear shrinkage of the estimated curvature matrix. We then demonstrate significant generalisation improvements by increasing the shrinkage coefficient, closing the generalisation gap entirely in both logistic regression and several deep neural network experiments. Extending this line further, we develop a novel random matrix theory based damping learner for second order optimiser inspired by linear shrinkage estimation. We experimentally demonstrate our learner to be very insensitive to the initialised value and to allow for extremely fast convergence in conjunction with continued stable training and competitive generalisation.