论文标题
多代理环境中的自动加固学习
Autotelic Reinforcement Learning in Multi-Agent Environments
论文作者
论文摘要
在内在动机的技能获取问题中,代理人是在没有任何预定目标的环境中设定的,需要获得开放式技能的曲目。为此,代理需要是自动的(源自希腊自动(自我)和Telos(最终目标)):它需要产生目标并学会遵循其自身的内在动机而不是外部监督。到目前为止,自动转基因已被孤立地考虑。但是,开放式学习的许多应用都需要代理人组。多代理环境对自动代理提出了一个额外的挑战:要发现和掌握需要合作代理商的目标,必须同时追求它们,但是如果他们独立采样,他们就有很少的机会这样做。在这项工作中,我们提出了一个新的学习范式,用于建模这种设置,分散的本质上动机的技能获取问题(DEC-IMSAP),并采用它来解决合作导航任务。首先,我们表明特工设定目标独立无法掌握全部目标的多样性。然后,我们表明,实现这一目标的足够条件是确保一个小组的目标保持一致,即代理人追求相同的合作目标。我们的经验分析表明,对齐能够实现专业化,这是合作的有效策略。最后,我们介绍了目标协调游戏,这是一种完全分离的紧急沟通算法,在多目标合作环境中,个体奖励的最大化,目标对齐源于该目标,并表明它能够与集中式培训基线达到同等的性能,从而确保目标一致。据我们所知,这是解决分散培训范式中本质上动机的多代理目标探索问题的第一项贡献。
In the intrinsically motivated skills acquisition problem, the agent is set in an environment without any pre-defined goals and needs to acquire an open-ended repertoire of skills. To do so the agent needs to be autotelic (deriving from the Greek auto (self) and telos (end goal)): it needs to generate goals and learn to achieve them following its own intrinsic motivation rather than external supervision. Autotelic agents have so far been considered in isolation. But many applications of open-ended learning entail groups of agents. Multi-agent environments pose an additional challenge for autotelic agents: to discover and master goals that require cooperation agents must pursue them simultaneously, but they have low chances of doing so if they sample them independently. In this work, we propose a new learning paradigm for modeling such settings, the Decentralized Intrinsically Motivated Skills Acquisition Problem (Dec-IMSAP), and employ it to solve cooperative navigation tasks. First, we show that agents setting their goals independently fail to master the full diversity of goals. Then, we show that a sufficient condition for achieving this is to ensure that a group aligns its goals, i.e., the agents pursue the same cooperative goal. Our empirical analysis shows that alignment enables specialization, an efficient strategy for cooperation. Finally, we introduce the Goal-coordination game, a fully-decentralized emergent communication algorithm, where goal alignment emerges from the maximization of individual rewards in multi-goal cooperative environments and show that it is able to reach equal performance to a centralized training baseline that guarantees aligned goals. To our knowledge, this is the first contribution addressing the problem of intrinsically motivated multi-agent goal exploration in a decentralized training paradigm.