通过在线镜子下降的广泛形式游戏中有效的Phi regret最小化

论文标题

通过在线镜子下降的广泛形式游戏中有效的Phi regret最小化

Efficient Phi-Regret Minimization in Extensive-Form Games via Online Mirror Descent

论文作者

Bai, Yu, Jin, Chi, Mei, Song, Song, Ziang, Yu, Tiancheng

论文摘要

学习广泛的游戏（EFG）的一种概念上吸引人的方法是将它们转换为正常形式游戏（NFGS）。这种方法使我们能够直接将NFG中的最新技术和分析转化为学习EFG，但由于转换引入的游戏尺寸的指数爆炸，通常会遭受计算棘手性。在本文中，我们在\ emph {$φ$ -HEDGE}算法的自然和重要设置中解决了这个问题 - 一种通用算法，能够学习NFG的大量平衡。我们表明，$φ$ - 围栏可直接用于学习NASH平衡（零和设置），正常形式的粗相关平衡（NFCCE）和EFG中广泛的形式相关平衡（EFCE）。我们证明，在这些设置中，\ emph {$φ$ -HEDGE}算法等同于具有适当扩张正则化器的EFGS的标准在线镜像（OMD）算法，并且在多项式时间内运行。这种新的连接进一步使我们能够根据修改其日志分区功能设计和分析新的OMD算法。特别是，我们通过平衡技术设计了一种改进的算法，可以在EFG中使用$ X $ $ X $信息集，$ a $ a $ a $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ efg，在bandit-feedback下实现了敏锐的$ \ wideTilde {\ Mathcal {\ Mathcal {O}}（\ sqrt {XAT}）$ EFCE-REGRET。据我们所知，这是第一个这样的速率，与信息理论下限匹配。

A conceptually appealing approach for learning Extensive-Form Games (EFGs) is to convert them to Normal-Form Games (NFGs). This approach enables us to directly translate state-of-the-art techniques and analyses in NFGs to learning EFGs, but typically suffers from computational intractability due to the exponential blow-up of the game size introduced by the conversion. In this paper, we address this problem in natural and important setups for the \emph{$Φ$-Hedge} algorithm -- A generic algorithm capable of learning a large class of equilibria for NFGs. We show that $Φ$-Hedge can be directly used to learn Nash Equilibria (zero-sum settings), Normal-Form Coarse Correlated Equilibria (NFCCE), and Extensive-Form Correlated Equilibria (EFCE) in EFGs. We prove that, in those settings, the \emph{$Φ$-Hedge} algorithms are equivalent to standard Online Mirror Descent (OMD) algorithms for EFGs with suitable dilated regularizers, and run in polynomial time. This new connection further allows us to design and analyze a new class of OMD algorithms based on modifying its log-partition function. In particular, we design an improved algorithm with balancing techniques that achieves a sharp $\widetilde{\mathcal{O}}(\sqrt{XAT})$ EFCE-regret under bandit-feedback in an EFG with $X$ information sets, $A$ actions, and $T$ episodes. To our best knowledge, this is the first such rate and matches the information-theoretic lower bound.

下载PDF全文

下载文献需遵守相关版权规定

论文标题