论文标题

解开SGD中隐式正规化背后的机制

Disentangling the Mechanisms Behind Implicit Regularization in SGD

论文作者

Novack, Zachary, Kaur, Simran, Marwah, Tanya, Garg, Saurabh, Lipton, Zachary C.

论文摘要

已经提出了许多相互竞争的假设,以解释为什么小批量随机梯度下降(SGD)导致对全批次制度的概括改善,而最近的工作将在整个培训中的各种数量的隐含正则化。但是,迄今为止,缺乏评估这些假设的解释力的经验证据。在本文中,我们进行了广泛的经验评估,重点介绍了各种理论机制缩小小型批次概括差距的能力。此外,我们表征了SGD在整个培训过程中(隐含地)正规化变更的数量。通过使用微批次,即每个微型批次的截然不同的子集,我们从经验上表明,明确惩罚梯度规范或Fisher Information Matrix Trace(在微批次上平均)在大批次制度中恢复了小批量的SGD SGD概括,而基于雅各布的正规化则无法解决。这种泛化性能通常与正规模型的梯度规范相似,类似于小批量SGD的梯度规范。我们还表明,随着微批量大小接近批处理大小,这种行为会崩溃。最后,我们注意到,在这一询问线上,CIFAR10上的积极实验发现通常在CIFAR100(例如CIFAR100)上反转,这突出了需要在更广泛的数据集集合中检验假设的需求。

A number of competing hypotheses have been proposed to explain why small-batch Stochastic Gradient Descent (SGD)leads to improved generalization over the full-batch regime, with recent work crediting the implicit regularization of various quantities throughout training. However, to date, empirical evidence assessing the explanatory power of these hypotheses is lacking. In this paper, we conduct an extensive empirical evaluation, focusing on the ability of various theorized mechanisms to close the small-to-large batch generalization gap. Additionally, we characterize how the quantities that SGD has been claimed to (implicitly) regularize change over the course of training. By using micro-batches, i.e. disjoint smaller subsets of each mini-batch, we empirically show that explicitly penalizing the gradient norm or the Fisher Information Matrix trace, averaged over micro-batches, in the large-batch regime recovers small-batch SGD generalization, whereas Jacobian-based regularizations fail to do so. This generalization performance is shown to often be correlated with how well the regularized model's gradient norms resemble those of small-batch SGD. We additionally show that this behavior breaks down as the micro-batch size approaches the batch size. Finally, we note that in this line of inquiry, positive experimental findings on CIFAR10 are often reversed on other datasets like CIFAR100, highlighting the need to test hypotheses on a wider collection of datasets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源