发现的策略优化

论文标题

发现的策略优化

Discovered Policy Optimisation

论文作者

Lu, Chris, Kuba, Jakub Grudzien, Letcher, Alistair, Metz, Luke, de Witt, Christian Schroeder, Foerster, Jakob

论文摘要

在过去的十年中，增强学习（RL）取得了巨大进展。这些进步大多数是通过不断开发新算法的，这些算法是通过数学推导，直觉和实验的组合设计的。这种手动创建算法的方法受到人类理解和创造力的限制。相反，元学习为自动机器学习方法优化提供了一种工具包，并有可能解决此缺陷。但是，试图发现具有最小先验结构的RL算法的黑框方法迄今尚未超越现有的手工制作算法。包括RL算法（例如PPO）提供的镜像学习提供了一个潜在的中间地面起点：虽然该框架中的每种方法都带有理论保证，但与众不同的组件会遵守设计。在本文中，我们通过学习“漂移”功能来探索镜像学习空间。我们将直接结果称为学习的政策优化（LPO）。通过分析LPO，我们获得了对政策优化的原始见解，我们用来制定一种新颖的，封闭形式的RL算法，发现了策略优化（DPO）。我们在Brax环境中的实验证实了LPO和DPO的最先进性能，以及它们转移到看不见的设置。

Tremendous progress has been made in reinforcement learning (RL) over the past decade. Most of these advancements came through the continual development of new algorithms, which were designed using a combination of mathematical derivations, intuitions, and experimentation. Such an approach of creating algorithms manually is limited by human understanding and ingenuity. In contrast, meta-learning provides a toolkit for automatic machine learning method optimisation, potentially addressing this flaw. However, black-box approaches which attempt to discover RL algorithms with minimal prior structure have thus far not outperformed existing hand-crafted algorithms. Mirror Learning, which includes RL algorithms, such as PPO, offers a potential middle-ground starting point: while every method in this framework comes with theoretical guarantees, components that differentiate them are subject to design. In this paper we explore the Mirror Learning space by meta-learning a "drift" function. We refer to the immediate result as Learnt Policy Optimisation (LPO). By analysing LPO we gain original insights into policy optimisation which we use to formulate a novel, closed-form RL algorithm, Discovered Policy Optimisation (DPO). Our experiments in Brax environments confirm state-of-the-art performance of LPO and DPO, as well as their transfer to unseen settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题