论文标题
广义强盗遗憾的最小化框架在不完美的信息中广泛形式游戏
Generalized Bandit Regret Minimizer Framework in Imperfect Information Extensive-Form Game
论文作者
论文摘要
遗憾的最小化方法是在两人零和不完美的信息广泛形式游戏(IIEGS)中学习近似NASH平衡(NE)的强大工具。我们考虑了交互式bandit反馈设置中的问题,我们不知道IIEG的动态。通常,仅显示交互式轨迹和到达的终端节点值$ v(z^t)$。要学习NE,需要遗憾的最小化器来估算$ v(z^t)$的全反馈损失梯度$ \ ell^t $,并最大程度地减少遗憾。在本文中,我们为此学习设置提出了一个广义框架。它提出了一个理论框架,用于设计和对匪徒遗憾最小化方法的模块化分析。我们证明,可以将最新的强盗遗憾最小化方法分析为我们框架的特定情况。遵循此框架,我们描述了一种新的方法,可以学习近似NE。它是无模型的,并且极大地提高了最佳现有收敛速率,从$ o(\ sqrt {x b/t} +\ sqrt {y c/t})$(\ sqrt {x b/t})$(\ sqrt {m _ {\ sqrt {\ sqrt {\ sqrt {\ nathcal {\ mathcal {x}}}}/t}/t} +\ sqrt { m _ {\ Mathcal {y}}/t})$。此外,六AMD在计算上是有效的,因为它需要仅沿采样轨迹执行当前策略和平均策略更新。
Regret minimization methods are a powerful tool for learning approximate Nash equilibrium (NE) in two-player zero-sum imperfect information extensive-form games (IIEGs). We consider the problem in the interactive bandit-feedback setting where we don't know the dynamics of the IIEG. In general, only the interactive trajectory and the reached terminal node value $v(z^t)$ are revealed. To learn NE, the regret minimizer is required to estimate the full-feedback loss gradient $\ell^t$ by $v(z^t)$ and minimize the regret. In this paper, we propose a generalized framework for this learning setting. It presents a theoretical framework for the design and the modular analysis of the bandit regret minimization methods. We demonstrate that the most recent bandit regret minimization methods can be analyzed as a particular case of our framework. Following this framework, we describe a novel method SIX-OMD to learn approximate NE. It is model-free and extremely improves the best existing convergence rate from the order of $O(\sqrt{X B/T}+\sqrt{Y C/T})$ to $O(\sqrt{ M_{\mathcal{X}}/T} +\sqrt{ M_{\mathcal{Y}}/T})$. Moreover, SIX-OMD is computationally efficient as it needs to perform the current strategy and average strategy updates only along the sampled trajectory.