非政策多代理分解政策梯度

论文标题

非政策多代理分解政策梯度

Off-Policy Multi-Agent Decomposed Policy Gradients

论文作者

Wang, Yihan, Han, Beining, Wang, Tonghan, Dong, Heng, Zhang, Chongjie

论文摘要

多代理政策梯度（MAPG）方法最近见证了进步。但是，MAPG方法与最先进的基于多代理价值的方法之间存在明显的性能差异。在本文中，我们调查的原因是阻碍了MAPG算法的性能，并提出了多代理分解的策略梯度方法（DOP）。该方法将价值函数分解的概念引入了多代理参与者 - 批评框架中。基于这个想法，DOP支持有效的非政策学习，并在离散和连续的动作空间中介绍了集中式缩写不匹配和信用分配的问题。我们正式表明，DOP批评者具有足够的代表性能力来确保收敛。此外，对Starcraft II微观管理基准和多代理粒子环境的经验评估表明，DOP显着超过了基于最先进的价值和基于策略的多代理增强增强学习算法。可在https://sites.google.com/view/dop-mapg/上找到演示性视频。

Multi-agent policy gradient (MAPG) methods recently witness vigorous progress. However, there is a significant performance discrepancy between MAPG methods and state-of-the-art multi-agent value-based approaches. In this paper, we investigate causes that hinder the performance of MAPG algorithms and present a multi-agent decomposed policy gradient method (DOP). This method introduces the idea of value function decomposition into the multi-agent actor-critic framework. Based on this idea, DOP supports efficient off-policy learning and addresses the issue of centralized-decentralized mismatch and credit assignment in both discrete and continuous action spaces. We formally show that DOP critics have sufficient representational capability to guarantee convergence. In addition, empirical evaluations on the StarCraft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP significantly outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms. Demonstrative videos are available at https://sites.google.com/view/dop-mapg/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题