论文标题
遗憾的
Regret Minimization and Convergence to Equilibria in General-sum Markov Games
论文作者
论文摘要
最近的大量不可能结果表明,在与对抗对手的马尔可夫游戏中最小化的遗憾在统计学上和计算上是棘手的。然而,这些结果都没有阻止遗憾最小化的可能性,因为所有当事方采用相同的学习程序。在这项工作中,我们介绍了在通用和马尔可夫游戏中学习的第一种(据我们所知)学习,该算法在所有代理商执行时提供了sublinear后悔的保证。我们获得的边界是为了置换遗憾,因此,在此过程中,意味着融合了相关的平衡。我们的算法是分散的,计算上有效的,并且不需要代理之间的任何通信。我们的主要观察结果是,通过马尔可夫游戏中的策略优化在线学习本质上是一种加权最小化的形式,而未知的权重由代理商策略序列的路径长度确定。因此,控制路径长度会导致加权遗憾的目标,以使足够适应性算法提供sublrinear后悔保证。
An abundance of recent impossibility results establish that regret minimization in Markov games with adversarial opponents is both statistically and computationally intractable. Nevertheless, none of these results preclude the possibility of regret minimization under the assumption that all parties adopt the same learning procedure. In this work, we present the first (to our knowledge) algorithm for learning in general-sum Markov games that provides sublinear regret guarantees when executed by all agents. The bounds we obtain are for swap regret, and thus, along the way, imply convergence to a correlated equilibrium. Our algorithm is decentralized, computationally efficient, and does not require any communication between agents. Our key observation is that online learning via policy optimization in Markov games essentially reduces to a form of weighted regret minimization, with unknown weights determined by the path length of the agents' policy sequence. Consequently, controlling the path length leads to weighted regret objectives for which sufficiently adaptive algorithms provide sublinear regret guarantees.