论文标题
发现的策略优化
Discovered Policy Optimisation
论文作者
论文摘要
在过去的十年中,增强学习(RL)取得了巨大进展。这些进步大多数是通过不断开发新算法的,这些算法是通过数学推导,直觉和实验的组合设计的。这种手动创建算法的方法受到人类理解和创造力的限制。相反,元学习为自动机器学习方法优化提供了一种工具包,并有可能解决此缺陷。但是,试图发现具有最小先验结构的RL算法的黑框方法迄今尚未超越现有的手工制作算法。包括RL算法(例如PPO)提供的镜像学习提供了一个潜在的中间地面起点:虽然该框架中的每种方法都带有理论保证,但与众不同的组件会遵守设计。在本文中,我们通过学习“漂移”功能来探索镜像学习空间。我们将直接结果称为学习的政策优化(LPO)。通过分析LPO,我们获得了对政策优化的原始见解,我们用来制定一种新颖的,封闭形式的RL算法,发现了策略优化(DPO)。我们在Brax环境中的实验证实了LPO和DPO的最先进性能,以及它们转移到看不见的设置。
Tremendous progress has been made in reinforcement learning (RL) over the past decade. Most of these advancements came through the continual development of new algorithms, which were designed using a combination of mathematical derivations, intuitions, and experimentation. Such an approach of creating algorithms manually is limited by human understanding and ingenuity. In contrast, meta-learning provides a toolkit for automatic machine learning method optimisation, potentially addressing this flaw. However, black-box approaches which attempt to discover RL algorithms with minimal prior structure have thus far not outperformed existing hand-crafted algorithms. Mirror Learning, which includes RL algorithms, such as PPO, offers a potential middle-ground starting point: while every method in this framework comes with theoretical guarantees, components that differentiate them are subject to design. In this paper we explore the Mirror Learning space by meta-learning a "drift" function. We refer to the immediate result as Learnt Policy Optimisation (LPO). By analysing LPO we gain original insights into policy optimisation which we use to formulate a novel, closed-form RL algorithm, Discovered Policy Optimisation (DPO). Our experiments in Brax environments confirm state-of-the-art performance of LPO and DPO, as well as their transfer to unseen settings.