梯度下降呈现为有限的时间尺度分离的严格局部Minmax平衡。

论文标题

梯度下降呈现为有限的时间尺度分离的严格局部Minmax平衡。

Gradient Descent-Ascent Provably Converges to Strict Local Minmax Equilibria with a Finite Timescale Separation

论文作者

Fiez, Tanner, Ratliff, Lillian

论文摘要

我们研究了有限的时间尺度分离参数$τ$在两人非convex，非concave零和游戏中呈梯度下降的作用，其中玩家1的学习率用$γ_1$表示，并且玩家2的学习率定义为$γ_2=γ_2=τγ__1$。现有的工作分析时间尺度分离在梯度下降中的作用主要集中在玩家共享学习率的边缘案例（$τ= 1 $）和最大化玩家在每个最小化播放器的每个更新之间大约融合（$τ\ rightarrow \ rightarrow \ infty $）。对于$τ= 1 $的参数选择，众所周知，学习动力学不能保证会融合到理论上有意义的平衡。相比之下，Jin等人。（2020）表明，梯度下降的稳定临界点与严格的局部Minmax平衡相吻合，为$τ\ rightarrow \ infty $。在这项工作中，我们通过显示有限的时间尺度参数$τ^{\ ast} $来弥合过去工作之间的差距此外，我们提供了一个明确的结构，用于计算$τ^{\ ast} $，以及在确定性和随机梯度反馈下的相应收敛速率和结果。我们提出的收敛结果是由非共存结果补充：给定关键点$ x^{\ ast} $，这不是严格的局部minmax平衡，然后存在有限的时间分离$τ_0$，以至于$ x^{\ ast} $对于所有$ x^{\ atst} $是$ undstable ost $ unctable，$ fty_0 $ fty_0，最后，我们在CIFAR-10和Celeba数据集上进行了经验证明，时间尺度分离对训练性能的重大影响。

We study the role that a finite timescale separation parameter $τ$ has on gradient descent-ascent in two-player non-convex, non-concave zero-sum games where the learning rate of player 1 is denoted by $γ_1$ and the learning rate of player 2 is defined to be $γ_2=τγ_1$. Existing work analyzing the role of timescale separation in gradient descent-ascent has primarily focused on the edge cases of players sharing a learning rate ($τ=1$) and the maximizing player approximately converging between each update of the minimizing player ($τ\rightarrow \infty$). For the parameter choice of $τ=1$, it is known that the learning dynamics are not guaranteed to converge to a game-theoretically meaningful equilibria in general. In contrast, Jin et al. (2020) showed that the stable critical points of gradient descent-ascent coincide with the set of strict local minmax equilibria as $τ\rightarrow\infty$. In this work, we bridge the gap between past work by showing there exists a finite timescale separation parameter $τ^{\ast}$ such that $x^{\ast}$ is a stable critical point of gradient descent-ascent for all $τ\in (τ^{\ast}, \infty)$ if and only if it is a strict local minmax equilibrium. Moreover, we provide an explicit construction for computing $τ^{\ast}$ along with corresponding convergence rates and results under deterministic and stochastic gradient feedback. The convergence results we present are complemented by a non-convergence result: given a critical point $x^{\ast}$ that is not a strict local minmax equilibrium, then there exists a finite timescale separation $τ_0$ such that $x^{\ast}$ is unstable for all $τ\in (τ_0, \infty)$. Finally, we empirically demonstrate on the CIFAR-10 and CelebA datasets the significant impact timescale separation has on training performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题