论文标题
QPLEX:双工决斗多代理Q学习
QPLEX: Duplex Dueling Multi-Agent Q-Learning
论文作者
论文摘要
我们在流行的集中式培训范式(CTDE)中探索了基于价值的多代理增强学习(MARL)。 CTDE具有一个重要的概念,即个人全球最大 - 最大原则,该原则需要联合和地方行动选择之间的一致性,以支持有效的本地决策。但是,为了实现可伸缩性,现有的MARL方法要么限制其价值功能类别的表现力,要么放宽IgM一致性,IgM一致性可能会遭受不稳定性风险或在复杂域中的表现不佳。本文提出了一种新型的MARL方法,称为双工决斗多代理Q学习(QPLEX),该方法采用双链决斗网络体系结构来分解关节价值函数。这种双工决斗结构将IGM原理编码为神经网络体系结构,从而实现有效的价值函数学习。理论分析表明,QPELX达到完整的IgM函数类。关于Starcraft II微管理任务的经验实验表明,QPELX在在线和离线数据收集设置中的最先进基线大大超过了最先进的基线,并且还表明QPLEX可实现较高的样本效率,并且可以在没有其他在线探索的情况下从离线数据集中受益。
We explore value-based multi-agent reinforcement learning (MARL) in the popular paradigm of centralized training with decentralized execution (CTDE). CTDE has an important concept, Individual-Global-Max (IGM) principle, which requires the consistency between joint and local action selections to support efficient local decision-making. However, in order to achieve scalability, existing MARL methods either limit representation expressiveness of their value function classes or relax the IGM consistency, which may suffer from instability risk or may not perform well in complex domains. This paper presents a novel MARL approach, called duPLEX dueling multi-agent Q-learning (QPLEX), which takes a duplex dueling network architecture to factorize the joint value function. This duplex dueling structure encodes the IGM principle into the neural network architecture and thus enables efficient value function learning. Theoretical analysis shows that QPLEX achieves a complete IGM function class. Empirical experiments on StarCraft II micromanagement tasks demonstrate that QPLEX significantly outperforms state-of-the-art baselines in both online and offline data collection settings, and also reveal that QPLEX achieves high sample efficiency and can benefit from offline datasets without additional online exploration.