嵌套您的自适应算法，以进行参数 - 敏锐的非convex minimax优化

论文标题

嵌套您的自适应算法，以进行参数 - 敏锐的非convex minimax优化

Nest Your Adaptive Algorithm for Parameter-Agnostic Nonconvex Minimax Optimization

论文作者

Yang, Junchi, Li, Xiang, He, Niao

论文摘要

Adagrad和Amsgrad等自适应算法由于其参数 - 不稳定能力而在非convex优化方面取得了成功 - 不需要关于特定问题参数的先验知识，也不需要对学习率的调整。但是，当涉及到非convex Minimax优化时，这种自适应优化器的直接扩展无适当的时间尺度分离可能无法在实践中工作。我们提供了这样的例子，证明，如果未仔细选择原始的偶型得分比，则与自适应步骤的梯度下降（GDA）的简单组合可能会有所不同；因此，这种适应性扩展不是参数 - 敏捷的。为了解决这个问题，我们正式引入了一个嵌套的自适应框架，即NEADA，该框架具有内在的循环，以适应具有可控的停止标准和外部循环自适应最大化双重变量，以自适应最大程度地减少原始变量。这种机制可以配备现成的自适应优化器，并自动平衡原始变量和双重变量的进度。从理论上讲，对于非convex的最小值问题，我们表明Neada可以实现近乎最佳的$ \ tilde {o} {o}（ε^{ - 2}）$和$ \ tilde {o}（O}（O}（O}）（ε^{ - 4}）在确定性的情况下以及在确定性的情况下进行精心依赖的情况，并不是在确定性的情况下进行精确的情况，并在确定性的情况下进行了精确的问题。参数。据我们所知，这是第一种同时在非convex minimax设置中同时实现近乎最佳的收敛速率和参数 - 不合骨的适应性的算法。从数值上讲，我们通过对简单测试功能和现实世界应用进行实验进一步说明了NEADA家族的鲁棒性。

Adaptive algorithms like AdaGrad and AMSGrad are successful in nonconvex optimization owing to their parameter-agnostic ability -- requiring no a priori knowledge about problem-specific parameters nor tuning of learning rates. However, when it comes to nonconvex minimax optimization, direct extensions of such adaptive optimizers without proper time-scale separation may fail to work in practice. We provide such an example proving that the simple combination of Gradient Descent Ascent (GDA) with adaptive stepsizes can diverge if the primal-dual stepsize ratio is not carefully chosen; hence, a fortiori, such adaptive extensions are not parameter-agnostic. To address the issue, we formally introduce a Nested Adaptive framework, NeAda for short, that carries an inner loop for adaptively maximizing the dual variable with controllable stopping criteria and an outer loop for adaptively minimizing the primal variable. Such mechanism can be equipped with off-the-shelf adaptive optimizers and automatically balance the progress in the primal and dual variables. Theoretically, for nonconvex-strongly-concave minimax problems, we show that NeAda can achieve the near-optimal $\tilde{O}(ε^{-2})$ and $\tilde{O}(ε^{-4})$ gradient complexities respectively in the deterministic and stochastic settings, without prior information on the problem's smoothness and strong concavity parameters. To the best of our knowledge, this is the first algorithm that simultaneously achieves near-optimal convergence rates and parameter-agnostic adaptation in the nonconvex minimax setting. Numerically, we further illustrate the robustness of the NeAda family with experiments on simple test functions and a real-world application.

下载PDF全文

下载文献需遵守相关版权规定

论文标题