通过加权平均来优化随机梯度下降的收敛

论文标题

通过加权平均来优化随机梯度下降的收敛

Optimized convergence of stochastic gradient descent by weighted averaging

论文作者

Hagedorn, Melinda, Jarre, Florian

论文摘要

在温和的假设下，如果所有迭代的算术平均值作为近似最佳解决方案返回，则渐近地实现了最佳收敛速率。但是，在没有随机噪声的情况下，所有迭代的算术平均值比迭代自身的算术平均值要慢得多。而且，在存在噪声的情况下，考虑到随机梯度方法的有限终止，算术平均值不一定是未知最佳解决方案的最佳近似值。本文旨在在一个特别简单的情况下确定最佳策略，即用i强烈凸功能的最小化。我。 d。噪声条款和有限终止。随机误差和优化误差的显式公式是在SGD方法的某些参数的依赖性中得出的。目的是选择参数，以便与算术平均相比，随机误差和优化误差都会减少。这个目标无法实现；但是，通过允许随机误差的略有增加，可以选择参数，从而大大减少优化误差。减少优化误差对由随机梯度方法生成的近似解决方案具有很强的影响，以防仅使用中等数量的迭代或初始误差较大时。数值示例证实了理论结果，并表明可能可以对非二次目标函数进行概括。

Under mild assumptions stochastic gradient methods asymptotically achieve an optimal rate of convergence if the arithmetic mean of all iterates is returned as an approximate optimal solution. However, in the absence of stochastic noise, the arithmetic mean of all iterates converges considerably slower to the optimal solution than the iterates themselves. And also in the presence of noise, when a finite termination of the stochastic gradient method is considered, the arithmetic mean is not necessarily the best possible approximation to the unknown optimal solution. This paper aims at identifying optimal strategies in a particularly simple case, the minimization of a strongly convex function with i. i. d. noise terms and finite termination. Explicit formulas for the stochastic error and the optimization error are derived in dependence of certain parameters of the SGD method. The aim was to choose parameters such that both stochastic error and optimization error are reduced compared to arithmetic averaging. This aim could not be achieved; however, by allowing a slight increase of the stochastic error it was possible to select the parameters such that a significant reduction of the optimization error could be achieved. This reduction of the optimization error has a strong effect on the approximate solution generated by the stochastic gradient method in case that only a moderate number of iterations is used or when the initial error is large. The numerical examples confirm the theoretical results and suggest that a generalization to non-quadratic objective functions may be possible.

下载PDF全文

下载文献需遵守相关版权规定

论文标题