论文标题
合作多代理增强学习的奖励机器
Reward Machines for Cooperative Multi-Agent Reinforcement Learning
论文作者
论文摘要
在合作的多代理增强学习中,一系列代理商学会了在共同的环境中进行互动以实现共同的目标。我们建议使用奖励机器(RM) - 用作奖励功能的结构化表示的Mealy机器来编码团队的任务。拟议的对RMS在多代理设置中的新颖解释明确编码了所需的队友相互依赖性,从而使团队级任务被分解为单个代理商的子任务。我们定义了RM分解的这种概念,并呈现算法可验证的条件,以确保分布式完成子任务导致团队行为完成原始任务。这个任务分解的框架为分散学习提供了一种自然的方法:代理商可以学会完成其子任务,同时仅观察其当地状态和队友的抽象表示。因此,我们提出了一种分散的Q学习算法。此外,在未估计的奖励的情况下,我们使用本地值函数来为与团队任务相对应的全局值函数得出下限和上限。实验结果在三个离散设置中的结果例证了所提出的RM分解方法的有效性,该方法将成功的团队策略收敛于一个比集中学习者快的数量级,并且明显优于层次结构和独立的Q学习方法。
In cooperative multi-agent reinforcement learning, a collection of agents learns to interact in a shared environment to achieve a common goal. We propose the use of reward machines (RM) -- Mealy machines used as structured representations of reward functions -- to encode the team's task. The proposed novel interpretation of RMs in the multi-agent setting explicitly encodes required teammate interdependencies, allowing the team-level task to be decomposed into sub-tasks for individual agents. We define such a notion of RM decomposition and present algorithmically verifiable conditions guaranteeing that distributed completion of the sub-tasks leads to team behavior accomplishing the original task. This framework for task decomposition provides a natural approach to decentralized learning: agents may learn to accomplish their sub-tasks while observing only their local state and abstracted representations of their teammates. We accordingly propose a decentralized q-learning algorithm. Furthermore, in the case of undiscounted rewards, we use local value functions to derive lower and upper bounds for the global value function corresponding to the team task. Experimental results in three discrete settings exemplify the effectiveness of the proposed RM decomposition approach, which converges to a successful team policy an order of magnitude faster than a centralized learner and significantly outperforms hierarchical and independent q-learning approaches.