论文标题
学习为合作多代理增强学习中的建议提供建议和学习
Learning to Advise and Learning from Advice in Cooperative Multi-Agent Reinforcement Learning
论文作者
论文摘要
学习协调是多代理增强学习(MARL)的艰巨问题。先前的工作已经从许多方面探索了它,包括代理人之间的认知,信用分配,沟通,专家演示等。但是,对代理商的决策结构和协调层次结构的关注更少。在本文中,我们探讨了代理决策的时空结构,并从多层次出现动力学的角度考虑了协调的层次结构,该方法基于一种新颖的方法,学习了一种新的方法,即学习向建议(LALA)学习(LALA),以改善MARL。具体而言,通过区分协调的层次结构,我们建议与顾问在Meso层面上加强决策协调,并利用政策歧视者在微观层面为代理商的学习提供建议。顾问学会在空间和时间领域汇总决策信息,并通过采用具有以任务为导向的目标函数的时空双图卷积神经网络来产生协调的决策。每个代理商通过政策生成对抗学习方法从建议中学习,其中歧视者区分代理商和顾问的政策,并根据其判断力来提高两者。实验结果表明,从学习效率和协调能力方面,LALA比基线方法的优势。从多级出现动力学和相互信息的角度研究了协调机制,该角度提供了一种新颖的观点和方法来分析和改善MARL算法。
Learning to coordinate is a daunting problem in multi-agent reinforcement learning (MARL). Previous works have explored it from many facets, including cognition between agents, credit assignment, communication, expert demonstration, etc. However, less attention were paid to agents' decision structure and the hierarchy of coordination. In this paper, we explore the spatiotemporal structure of agents' decisions and consider the hierarchy of coordination from the perspective of multilevel emergence dynamics, based on which a novel approach, Learning to Advise and Learning from Advice (LALA), is proposed to improve MARL. Specifically, by distinguishing the hierarchy of coordination, we propose to enhance decision coordination at meso level with an advisor and leverage a policy discriminator to advise agents' learning at micro level. The advisor learns to aggregate decision information in both spatial and temporal domains and generates coordinated decisions by employing a spatiotemporal dual graph convolutional neural network with a task-oriented objective function. Each agent learns from the advice via a policy generative adversarial learning method where a discriminator distinguishes between the policies of the agent and the advisor and boosts both of them based on its judgement. Experimental results indicate the advantage of LALA over baseline approaches in terms of both learning efficiency and coordination capability. Coordination mechanism is investigated from the perspective of multilevel emergence dynamics and mutual information point of view, which provides a novel perspective and method to analyze and improve MARL algorithms.