用于价值函数的剩余Q网络在多代理增强学习中分解

论文标题

用于价值函数的剩余Q网络在多代理增强学习中分解

Residual Q-Networks for Value Function Factorizing in Multi-Agent Reinforcement Learning

论文作者

Pina, Rafael, De Silva, Varuna, Hook, Joosep, Kondoz, Ahmet

论文摘要

多代理增强学习（MARL）在需要多种代理的合作和协调的许多问题中很有用。随着代理数量的增加，在多代理设置中使用强化学习的学习最佳政策可能非常困难。最近的解决方案，例如价值分解网络（VDN），QMIX，QTRAN和QPLEX遵守集中式训练和分散的执行方案，并执行联合行动值函数的分解。但是，这些方法仍然遭受着增加的环境复杂性，有时无法以稳定的方式融合。我们提出了一个新颖的MARL剩余Q-Networks（RQN）的概念，该概念学会以保留单个全球最大最大标准（IGM）的方式来改变单个Q值轨迹，但在分解动作值功能方面更强大。 RQN充当辅助网络，该网络会加速收敛，并随着代理达到训练目标而变得过时。将所提出方法的性能与多种最新技术（例如QPLEX，QMIX，QTRAN和VDN）进行了比较。结果表明，所提出的方法通常会随着稳定性的提高而更快地收敛，并在更广泛的环境家族中表现出强劲的性能。在严重惩罚非合作行为的环境中，结果的改善更为突出，尤其是在训练时间期间没有完整的状态信息的情况下。

Multi-Agent Reinforcement Learning (MARL) is useful in many problems that require the cooperation and coordination of multiple agents. Learning optimal policies using reinforcement learning in a multi-agent setting can be very difficult as the number of agents increases. Recent solutions such as Value Decomposition Networks (VDN), QMIX, QTRAN and QPLEX adhere to the centralized training and decentralized execution scheme and perform factorization of the joint action-value functions. However, these methods still suffer from increased environmental complexity, and at times fail to converge in a stable manner. We propose a novel concept of Residual Q-Networks (RQNs) for MARL, which learns to transform the individual Q-value trajectories in a way that preserves the Individual-Global-Max criteria (IGM), but is more robust in factorizing action-value functions. The RQN acts as an auxiliary network that accelerates convergence and will become obsolete as the agents reach the training objectives. The performance of the proposed method is compared against several state-of-the-art techniques such as QPLEX, QMIX, QTRAN and VDN, in a range of multi-agent cooperative tasks. The results illustrate that the proposed method, in general, converges faster, with increased stability and shows robust performance in a wider family of environments. The improvements in results are more prominent in environments with severe punishments for non-cooperative behaviours and especially in the absence of complete state information during training time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题