广义强盗遗憾的最小化框架在不完美的信息中广泛形式游戏

论文标题

广义强盗遗憾的最小化框架在不完美的信息中广泛形式游戏

Generalized Bandit Regret Minimizer Framework in Imperfect Information Extensive-Form Game

论文作者

Meng, Linjian, Gao, Yang

论文摘要

遗憾的最小化方法是在两人零和不完美的信息广泛形式游戏（IIEGS）中学习近似NASH平衡（NE）的强大工具。我们考虑了交互式bandit反馈设置中的问题，我们不知道IIEG的动态。通常，仅显示交互式轨迹和到达的终端节点值$ v（z^t）$。要学习NE，需要遗憾的最小化器来估算$ v（z^t）$的全反馈损失梯度$ \ ell^t $，并最大程度地减少遗憾。在本文中，我们为此学习设置提出了一个广义框架。它提出了一个理论框架，用于设计和对匪徒遗憾最小化方法的模块化分析。我们证明，可以将最新的强盗遗憾最小化方法分析为我们框架的特定情况。遵循此框架，我们描述了一种新的方法，可以学习近似NE。它是无模型的，并且极大地提高了最佳现有收敛速率，从$ o（\ sqrt {x b/t} +\ sqrt {y c/t}）$（\ sqrt {x b/t}）$（\ sqrt {m _ {\ sqrt {\ sqrt {\ sqrt {\ nathcal {\ mathcal {x}}}}/t}/t} +\ sqrt { m _ {\ Mathcal {y}}/t}）$。此外，六AMD在计算上是有效的，因为它需要仅沿采样轨迹执行当前策略和平均策略更新。

Regret minimization methods are a powerful tool for learning approximate Nash equilibrium (NE) in two-player zero-sum imperfect information extensive-form games (IIEGs). We consider the problem in the interactive bandit-feedback setting where we don't know the dynamics of the IIEG. In general, only the interactive trajectory and the reached terminal node value $v(z^t)$ are revealed. To learn NE, the regret minimizer is required to estimate the full-feedback loss gradient $\ell^t$ by $v(z^t)$ and minimize the regret. In this paper, we propose a generalized framework for this learning setting. It presents a theoretical framework for the design and the modular analysis of the bandit regret minimization methods. We demonstrate that the most recent bandit regret minimization methods can be analyzed as a particular case of our framework. Following this framework, we describe a novel method SIX-OMD to learn approximate NE. It is model-free and extremely improves the best existing convergence rate from the order of $O(\sqrt{X B/T}+\sqrt{Y C/T})$ to $O(\sqrt{ M_{\mathcal{X}}/T} +\sqrt{ M_{\mathcal{Y}}/T})$. Moreover, SIX-OMD is computationally efficient as it needs to perform the current strategy and average strategy updates only along the sampled trajectory.

下载PDF全文

下载文献需遵守相关版权规定

论文标题