随机梯度下降的良性不足

论文标题

随机梯度下降的良性不足

Benign Underfitting of Stochastic Gradient Descent

论文作者

Koren, Tomer, Livni, Roi, Mansour, Yishay, Sherman, Uri

论文摘要

我们研究随机梯度下降（SGD）在多大程度上被理解为“常规”学习规则，该规则通过获得良好的培训数据来实现概括性能。我们考虑了基本的随机凸优化框架，其中（一个通行证，无需恢复）SGD在经典上是众所周知的，可以最大程度地减少$ o（1/\ sqrt n）$的人口风险，并且出人意料地证明，出人意料的是，SGD解决方案存在经验性风险和总体化的地方，并且存在$ OPE（1）的经验风险。因此，事实证明，从任何意义上讲，SGD在任何意义上都不是算法稳定的，并且其概括能力不能通过统一的收敛性或目前对此的任何其他已知的概括结合技术（除了其经典分析）来解释。然后，我们继续分析与替代SGD密切相关的，为此，我们表明不会发生类似现象，并证明其人口风险实际上确实以最佳速度融合。最后，我们在没有替换SGD的背景下解释了我们的主要结果，以解决有限的和凸优化问题，并为多上述制度的上限和下限得出了显着改善以前已知的结果。

We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of $Ω(1)$. Consequently, it turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related with-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题