大规模马尔可夫潜在游戏的独立政策梯度：较高的速率，功能近似和游戏不合时宜的融合

论文标题

大规模马尔可夫潜在游戏的独立政策梯度：较高的速率，功能近似和游戏不合时宜的融合

Independent Policy Gradient for Large-Scale Markov Potential Games: Sharper Rates, Function Approximation, and Game-Agnostic Convergence

论文作者

Ding, Dongsheng, Wei, Chen-Yu, Zhang, Kaiqing, Jovanović, Mihailo R.

论文摘要

我们研究了马尔可夫潜在游戏（MPG）中多机构增强学习（RL）问题的策略梯度方法的全球非反应收敛性。要学习MPG的NASH均衡，在该MPG中，国家空间的大小和/或玩家数量可能非常大，我们建议由所有串联玩家运行的新的独立政策梯度算法。当梯度评估中没有不确定性时，我们表明我们的算法找到了$ε$ -NASH平衡，而$ O（1/ε^2）$迭代复杂性并不明确取决于状态空间大小。如果确切的梯度不可用，我们将建立$ O（1/ε^5）$样品复杂性在潜在的无限状态空间中，用于利用函数近似的基于样本的算法。此外，我们确定了一类独立的政策梯度算法，这些算法都可以融合零和马尔可夫游戏和马尔可夫合作游戏，而不是忽略正在玩的游戏类型的玩家。最后，我们提供了计算实验来证实理论发展的优点和有效性。

We examine global non-asymptotic convergence properties of policy gradient methods for multi-agent reinforcement learning (RL) problems in Markov potential games (MPG). To learn a Nash equilibrium of an MPG in which the size of state space and/or the number of players can be very large, we propose new independent policy gradient algorithms that are run by all players in tandem. When there is no uncertainty in the gradient evaluation, we show that our algorithm finds an $ε$-Nash equilibrium with $O(1/ε^2)$ iteration complexity which does not explicitly depend on the state space size. When the exact gradient is not available, we establish $O(1/ε^5)$ sample complexity bound in a potentially infinitely large state space for a sample-based algorithm that utilizes function approximation. Moreover, we identify a class of independent policy gradient algorithms that enjoys convergence for both zero-sum Markov games and Markov cooperative games with the players that are oblivious to the types of games being played. Finally, we provide computational experiments to corroborate the merits and the effectiveness of our theoretical developments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题