人级atari 200倍更快

论文标题

人级atari 200倍更快

Human-level Atari 200x faster

论文作者

Kapturowski, Steven, Campos, Víctor, Jiang, Ray, Rakićević, Nemanja, van Hasselt, Hado, Blundell, Charles, Badia, Adrià Puigdomènech

论文摘要

自成立以来，建立在各种任务中表现出色的普通代理的任务一直是强化学习的重要目标。这个问题一直是对大量工作的研究的主题，并且经常通过观察Atari 57基准中包含的各种环境的分数来衡量性能。 Agent57是所有57场比赛中第一个超过人类基准的代理商，但这是以差数据效率为代价的，需要近800亿帧才能实现。以Agent57为起点，我们采用各种策略来实现执行人类基线所需的经验200倍。我们研究了在减少数据制度的同时遇到的一系列不稳定性和瓶颈，并提出了有效的解决方案来构建更强大，更有效的代理。我们还使用诸如Muesli和Muzero之类的高性能方法展示了竞争性能。我们方法的四个关键组成部分是（1）一个大致的信任区域方法，可以从在线网络中稳定进行引导，（2）在学习一组价值和优先级的归一化方案时，在学习一组价值功能时，可以提高稳健性，（（3）不需要进行范围的范围，而无需使用nfnet的范围，而无需使用nfnet，而无需进行范围的范围。平滑瞬时贪婪政策的加时性。

The task of building general agents that perform well over a wide range of tasks has been an important goal in reinforcement learning since its inception. The problem has been subject of research of a large body of work, with performance frequently measured by observing scores over the wide range of environments contained in the Atari 57 benchmark. Agent57 was the first agent to surpass the human benchmark on all 57 games, but this came at the cost of poor data-efficiency, requiring nearly 80 billion frames of experience to achieve. Taking Agent57 as a starting point, we employ a diverse set of strategies to achieve a 200-fold reduction of experience needed to out perform the human baseline. We investigate a range of instabilities and bottlenecks we encountered while reducing the data regime, and propose effective solutions to build a more robust and efficient agent. We also demonstrate competitive performance with high-performing methods such as Muesli and MuZero. The four key components to our approach are (1) an approximate trust region method which enables stable bootstrapping from the online network, (2) a normalisation scheme for the loss and priorities which improves robustness when learning a set of value functions with a wide range of scales, (3) an improved architecture employing techniques from NFNets in order to leverage deeper networks without the need for normalization layers, and (4) a policy distillation method which serves to smooth out the instantaneous greedy policy overtime.

下载PDF全文

下载文献需遵守相关版权规定

论文标题