深度学习率阶段中深线性网络的隐式偏见

论文标题

深度学习率阶段中深线性网络的隐式偏见

Implicit bias of deep linear networks in the large learning rate phase

论文作者

Huang, Wei, Du, Weitao, Da Xu, Richard Yi, Liu, Chunrui

论文摘要

大多数解释深度学习效应的理论研究仅集中于具有足够小的学习率甚至梯度流（无限学习率）的梯度下降。但是，此类研究已忽略了在大多数实际应用中应用的相当大的学习率。在这项工作中，我们表征了深层线性网络对二进制分类的隐性偏见效应，该二进制分类使用了大型学习率制度中的逻辑损失，这是受到Lewkowycz等人的开创性工作的启发。 [26]在带有平方损失的回归环境中。他们发现了一个名为“弹射阶段”的学习率制度，在训练的早期阶段损失增长，最终将其收敛到最低限度，比小小的学习率制度中发现的损失更为平坦。我们声称，根据数据的分离条件，在弹射阶段，梯度下降迭代将收敛至最低限度。我们严格地在退化数据的假设下通过克服了非稳定性黑es损失的难度，并进一步表征了损失的行为和黑森的行为，而不是可分离的数据，就严格证明了这一主张。最后，我们证明，在不可分割的数据跨越空间中的最小值以及弹射阶段的学习率可以从经验上可以更好地概括。

Most theoretical studies explaining the regularization effect in deep learning have only focused on gradient descent with a sufficient small learning rate or even gradient flow (infinitesimal learning rate). Such researches, however, have neglected a reasonably large learning rate applied in most practical applications. In this work, we characterize the implicit bias effect of deep linear networks for binary classification using the logistic loss in the large learning rate regime, inspired by the seminal work by Lewkowycz et al. [26] in a regression setting with squared loss. They found a learning rate regime with a large stepsize named the catapult phase, where the loss grows at the early stage of training and eventually converges to a minimum that is flatter than those found in the small learning rate regime. We claim that depending on the separation conditions of data, the gradient descent iterates will converge to a flatter minimum in the catapult phase. We rigorously prove this claim under the assumption of degenerate data by overcoming the difficulty of the non-constant Hessian of logistic loss and further characterize the behavior of loss and Hessian for non-separable data. Finally, we demonstrate that flatter minima in the space spanned by non-separable data along with the learning rate in the catapult phase can lead to better generalization empirically.

下载PDF全文

下载文献需遵守相关版权规定

论文标题