具有近似同步优势估计的多代理策略优化

论文标题

具有近似同步优势估计的多代理策略优化

Multi-agent Policy Optimization with Approximatively Synchronous Advantage Estimation

论文作者

Wan, Lipeng, Song, Xuwei, Lan, Xuguang, Zheng, Nanning

论文摘要

合作的多代理任务要求代理商以共同的全球奖励（称为信用分配的挑战）推断出自己的贡献。基于策略的多代理增强学习来解决挑战的一般方法引入了差异化的价值功能或优势功能。在多代理系统中，需要共同评估不同代理的政策。为了同步更新策略，此类价值功能或优势功能也需要同步评估。但是，在当前方法中，价值函数或优势函数采用了反事实的关节作用，这些关节作用异步评估，因此会遭受自然估计偏差的困扰。在这项工作中，我们提出了近似同步的优势估计。我们首先得出了边际优势函数，这是从单代理优势函数到多代理系统的扩展。此外，我们还引入了同步优势估计的策略近似值，并将多代理策略优化问题分解为单代理策略优化的多个子问题。将我们的方法与星际争霸多代理挑战的基线算法进行了比较，并在大多数任务上显示出最佳性能。

Cooperative multi-agent tasks require agents to deduce their own contributions with shared global rewards, known as the challenge of credit assignment. General methods for policy based multi-agent reinforcement learning to solve the challenge introduce differentiate value functions or advantage functions for individual agents. In multi-agent system, polices of different agents need to be evaluated jointly. In order to update polices synchronously, such value functions or advantage functions also need synchronous evaluation. However, in current methods, value functions or advantage functions use counter-factual joint actions which are evaluated asynchronously, thus suffer from natural estimation bias. In this work, we propose the approximatively synchronous advantage estimation. We first derive the marginal advantage function, an expansion from single-agent advantage function to multi-agent system. Further more, we introduce a policy approximation for synchronous advantage estimation, and break down the multi-agent policy optimization problem into multiple sub-problems of single-agent policy optimization. Our method is compared with baseline algorithms on StarCraft multi-agent challenges, and shows the best performance on most of the tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题