学习战略探索和快速奖励转移的摘要模型

论文标题

学习战略探索和快速奖励转移的摘要模型

Learning Abstract Models for Strategic Exploration and Fast Reward Transfer

论文作者

Liu, Evan Zheran, Keramati, Ramtin, Seshadri, Sudarshan, Guu, Kelvin, Pasupat, Panupong, Brunskill, Emma, Liang, Percy

论文摘要

基于模型的强化学习（RL）很有吸引力，因为（i）它可以实现计划，从而更具战略性探索，并且（ii）通过将动态从奖励中解散，它可以快速转移到新的奖励功能。但是，在高维状态（例如RAW像素）上学习准确的马尔可夫决策过程（MDP）极具挑战性，因为它需要函数近似，这会导致复杂的错误。取而代之的是，为避免复杂错误，我们建议通过抽象状态学习一个抽象的MDP：状态的低维粗表示（例如，捕获代理位置，忽略其他对象）。我们假设访问将混凝土状态映射到抽象状态的抽象函数。在我们的方法中，我们构建了一个抽象的MDP，该MDP通过计划通过战略探索而生长。与层次RL方法相似，抽象MDP的抽象作用是通过在抽象状态之间导航的学习亚物质的支持。我们的方法在三个最难的街机学习环境游戏（蒙特祖玛的复仇，陷阱！和私人眼睛）上取得了出色的成绩，包括超人的表现！没有示威。经过一项任务训练后，我们可以将学习的抽象MDP用于新的奖励功能，在1000倍少的样本中获得比从头开始训练的无模型方法更少的奖励。

Model-based reinforcement learning (RL) is appealing because (i) it enables planning and thus more strategic exploration, and (ii) by decoupling dynamics from rewards, it enables fast transfer to new reward functions. However, learning an accurate Markov Decision Process (MDP) over high-dimensional states (e.g., raw pixels) is extremely challenging because it requires function approximation, which leads to compounding errors. Instead, to avoid compounding errors, we propose learning an abstract MDP over abstract states: low-dimensional coarse representations of the state (e.g., capturing agent position, ignoring other objects). We assume access to an abstraction function that maps the concrete states to abstract states. In our approach, we construct an abstract MDP, which grows through strategic exploration via planning. Similar to hierarchical RL approaches, the abstract actions of the abstract MDP are backed by learned subpolicies that navigate between abstract states. Our approach achieves strong results on three of the hardest Arcade Learning Environment games (Montezuma's Revenge, Pitfall!, and Private Eye), including superhuman performance on Pitfall! without demonstrations. After training on one task, we can reuse the learned abstract MDP for new reward functions, achieving higher reward in 1000x fewer samples than model-free methods trained from scratch.

下载PDF全文

下载文献需遵守相关版权规定

论文标题