分散政策优化

论文标题

分散政策优化

Decentralized Policy Optimization

论文作者

Su, Kefan, Lu, Zongqing

论文摘要

在合作多代理增强学习中对分散学习或独立学习的研究具有数十年的历史。最近的经验研究表明，在几个基准中，独立的PPO（IPPO）可以比通过分散执行的集中式培训的方法获得良好的表现，甚至更好。但是，具有融合保证的分散式参与者 - 批评仍然开放。在本文中，我们提出\ textit {分散策略优化}（DPO），这是一种分散的Actor-Critic-Critic算法，具有单调改进和收敛保证。我们为策略优化提供了一个新颖的分散替代替代物，以便通过每个代理人\ textit {独立}来确保联合政策的单调改进，以优化替代物。在实践中，该分散的替代物可以通过两个自适应系数来实现每个代理的策略优化。从经验上讲，我们将DPO与IPPO进行了多种合作多代理任务的比较，涵盖了离散和连续的动作空间，以及完全可观察到的环境。结果表明，DPO在大多数任务中都优于IPPO，这可能是我们理论结果的证据。

The study of decentralized learning or independent learning in cooperative multi-agent reinforcement learning has a history of decades. Recently empirical studies show that independent PPO (IPPO) can obtain good performance, close to or even better than the methods of centralized training with decentralized execution, in several benchmarks. However, decentralized actor-critic with convergence guarantee is still open. In this paper, we propose \textit{decentralized policy optimization} (DPO), a decentralized actor-critic algorithm with monotonic improvement and convergence guarantee. We derive a novel decentralized surrogate for policy optimization such that the monotonic improvement of joint policy can be guaranteed by each agent \textit{independently} optimizing the surrogate. In practice, this decentralized surrogate can be realized by two adaptive coefficients for policy optimization at each agent. Empirically, we compare DPO with IPPO in a variety of cooperative multi-agent tasks, covering discrete and continuous action spaces, and fully and partially observable environments. The results show DPO outperforms IPPO in most tasks, which can be the evidence for our theoretical results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题