镜像学习：统一的政策优化框架

论文标题

镜像学习：统一的政策优化框架

Mirror Learning: A Unifying Framework of Policy Optimisation

论文作者

Kuba, Jakub Grudzien, de Witt, Christian Schroeder, Foerster, Jakob

论文摘要

现代深度强化学习（RL）算法是由广义政策迭代（GPI）或信任区域学习（TRL）框架激励的。但是，严格尊重这些理论框架的算法已被证明是不可计算的。令人惊讶的是，唯一已知的可扩展算法违反了GPI/TRL假设，例如由于需要正规化或其他启发式法。当前对其经验成功的解释本质上是“类比”：他们被认为是理论上合理方法的近似适应。不幸的是，研究表明，在实践中，这些算法与它们的概念祖先有很大不同。相比之下，在本文中，我们介绍了一个新颖的理论框架，名为《镜像学习》，该框架为包括TRPO和PPO在内的大量算法提供了理论保证。尽管后两个利用了我们的框架的灵活性，但GPI和TRL仅适合其病理限制性的角病例。这表明最先进方法的经验性能是其理论特性的直接结果，而不是上述近似类比的直接结果。镜像学习使我们可以自由地大胆地探索小说，理论上声音RL算法，这是迄今未知的仙境。

Modern deep reinforcement learning (RL) algorithms are motivated by either the generalised policy iteration (GPI) or trust-region learning (TRL) frameworks. However, algorithms that strictly respect these theoretical frameworks have proven unscalable. Surprisingly, the only known scalable algorithms violate the GPI/TRL assumptions, e.g. due to required regularisation or other heuristics. The current explanation of their empirical success is essentially "by analogy": they are deemed approximate adaptations of theoretically sound methods. Unfortunately, studies have shown that in practice these algorithms differ greatly from their conceptual ancestors. In contrast, in this paper we introduce a novel theoretical framework, named Mirror Learning, which provides theoretical guarantees to a large class of algorithms, including TRPO and PPO. While the latter two exploit the flexibility of our framework, GPI and TRL fit in merely as pathologically restrictive corner cases thereof. This suggests that the empirical performance of state-of-the-art methods is a direct consequence of their theoretical properties, rather than of aforementioned approximate analogies. Mirror learning sets us free to boldly explore novel, theoretically sound RL algorithms, a thus far uncharted wonderland.

下载PDF全文

下载文献需遵守相关版权规定

论文标题