马尔可夫潜在游戏中软马克斯政策梯度的无政府状态保证的融合和价格

论文标题

马尔可夫潜在游戏中软马克斯政策梯度的无政府状态保证的融合和价格

Convergence and Price of Anarchy Guarantees of the Softmax Policy Gradient in Markov Potential Games

论文作者

Chen, Dingyang, Zhang, Qi, Doan, Thinh T.

论文摘要

我们研究了Markov Games的策略梯度方法的性能，称为Markov潜在游戏（MPG），该游戏将正常形式潜在游戏的概念扩展到了状态环境，其中包括代理商具有相同奖励功能的完全合作环境的重要特殊情况。我们本文的重点是研究在SoftMax策略参数化下求解MPG的策略梯度方法的收敛，无论是表格和参数，都用一般函数近似值（例如神经网络）进行了参数化。我们首先显示了该方法对MPG的NASH平衡的渐近收敛性，以进行表格软智能策略。其次，我们在两个设置中得出了策略梯度的有限时间性能：1）使用对数挡仪正则化，以及2）在最佳响应动力学（NPG-BR）下使用自然策略梯度。最后，我们在正常游戏中扩展了无政府状态（POA）的价格（POA）的概念，我们介绍了MPG的POA，并为NPG-BR提供了POA。据我们所知，这是第一个用于解决MPG的POA。为了支持我们的理论结果，我们从经验上比较了表格和神经软性策略的策略梯度变体的收敛速率和POA。

We study the performance of policy gradient methods for the subclass of Markov games known as Markov potential games (MPGs), which extends the notion of normal-form potential games to the stateful setting and includes the important special case of the fully cooperative setting where the agents share an identical reward function. Our focus in this paper is to study the convergence of the policy gradient method for solving MPGs under softmax policy parameterization, both tabular and parameterized with general function approximators such as neural networks. We first show the asymptotic convergence of this method to a Nash equilibrium of MPGs for tabular softmax policies. Second, we derive the finite-time performance of the policy gradient in two settings: 1) using the log-barrier regularization, and 2) using the natural policy gradient under the best-response dynamics (NPG-BR). Finally, extending the notion of price of anarchy (POA) and smoothness in normal-form games, we introduce the POA for MPGs and provide a POA bound for NPG-BR. To our knowledge, this is the first POA bound for solving MPGs. To support our theoretical results, we empirically compare the convergence rates and POA of policy gradient variants for both tabular and neural softmax policies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题