与预训练的异质观察表征的合作政策学习

论文标题

与预训练的异质观察表征的合作政策学习

Cooperative Policy Learning with Pre-trained Heterogeneous Observation Representations

论文作者

Shi, Wenlei, Wei, Xinran, Zhang, Jia, Ni, Xiaoyuan, Jiang, Arthur, Bian, Jiang, Liu, Tie-Yan

论文摘要

越来越多地探索了多项式强化学习（MARL），以学习合作政策，以最大程度地提高某些全球奖励。许多现有的研究利用了MARL中的图形神经网络（GNN），以基于相互连接的代理的相互作用图传播关键的协作信息。然而，由于通用消息传递机制在异质性顶点之间无效，而且简单的消息聚集函数无法准确地对来自多个邻居的组合相互作用建模，因此，香草GNN方法在处理复杂的现实情况时会产生重大缺陷。尽管采用具有更多信息信息传递和聚合机制的复杂GNN模型，显然可以使异质顶点表示和合作政策学习受益，但另一方面，它可以增加MARL的培训难度，并要求与原始全球奖励相比，要求更强烈和直接的奖励信号。为了应对这些挑战，我们提出了一个新的合作学习框架，并具有预训练的异质观察表征。特别是，我们采用基于编码器的图形注意力来学习复杂的相互作用和异质表示，而MARL可以更容易利用这些相互作用。此外，我们使用当地参与者批评算法设计了预训练，以减轻合作政策学习的困难。对现实情况的广泛实验表明，我们的新方法可以极大地超过现有的MARL基准以及在行业中广泛使用的运营研究解决方案。

Multi-agent reinforcement learning (MARL) has been increasingly explored to learn the cooperative policy towards maximizing a certain global reward. Many existing studies take advantage of graph neural networks (GNN) in MARL to propagate critical collaborative information over the interaction graph, built upon inter-connected agents. Nevertheless, the vanilla GNN approach yields substantial defects in dealing with complex real-world scenarios since the generic message passing mechanism is ineffective between heterogeneous vertices and, moreover, simple message aggregation functions are incapable of accurately modeling the combinational interactions from multiple neighbors. While adopting complex GNN models with more informative message passing and aggregation mechanisms can obviously benefit heterogeneous vertex representations and cooperative policy learning, it could, on the other hand, increase the training difficulty of MARL and demand more intense and direct reward signals compared to the original global reward. To address these challenges, we propose a new cooperative learning framework with pre-trained heterogeneous observation representations. Particularly, we employ an encoder-decoder based graph attention to learn the intricate interactions and heterogeneous representations that can be more easily leveraged by MARL. Moreover, we design a pre-training with local actor-critic algorithm to ease the difficulty in cooperative policy learning. Extensive experiments over real-world scenarios demonstrate that our new approach can significantly outperform existing MARL baselines as well as operational research solutions that are widely-used in industry.

下载PDF全文

下载文献需遵守相关版权规定

论文标题