首先返回，然后探索

论文标题

首先返回，然后探索

First return, then explore

论文作者

Ecoffet, Adrien, Huizinga, Joost, Lehman, Joel, Stanley, Kenneth O., Clune, Jeff

论文摘要

强化学习的承诺是通过仅指定高级奖励功能来自动解决复杂的顺序决策问题。但是，强化学习算法在通常情况下，简单而直观的奖励提供稀疏而欺骗性的反馈。避免这些陷阱需要彻底探索环境，但是创建可以做到的算法仍然是该领域的主要挑战之一。我们假设有效探索的主要障碍源于算法忘记了如何到达先前访问的州（“分离”），而不是在与之探索之前首次返回州（“脱轨”）。我们介绍了Go-explore，这是一个算法家族，它通过明确记住有前途的状态并在故意探索之前首次返回此类状态的简单原则直接解决这两个挑战。 Go-explore解决了所有迄今未解决的Atari游戏，并在所有硬探索游戏中都超越了最新的现状，并在蒙特祖玛的报仇和陷阱方面提高了数量级。我们还展示了在稀疏回报的挑选机器人机器人任务上进行探索的实际潜力。此外，我们表明，添加目标条件政策可以进一步提高Go-explore的勘探效率，并使其能够在整个培训中处理随机性。 Go-explore的实质性表现表明，记住国家，返回它们和探索它们的简单原则是一种强大而一般的探索方法，这种见解可能对创造真正聪明的学习代理人的创造至关重要。

The promise of reinforcement learning is to solve complex sequential decision problems autonomously by specifying a high-level reward function only. However, reinforcement learning algorithms struggle when, as is often the case, simple and intuitive rewards provide sparse and deceptive feedback. Avoiding these pitfalls requires thoroughly exploring the environment, but creating algorithms that can do so remains one of the central challenges of the field. We hypothesise that the main impediment to effective exploration originates from algorithms forgetting how to reach previously visited states ("detachment") and from failing to first return to a state before exploring from it ("derailment"). We introduce Go-Explore, a family of algorithms that addresses these two challenges directly through the simple principles of explicitly remembering promising states and first returning to such states before intentionally exploring. Go-Explore solves all heretofore unsolved Atari games and surpasses the state of the art on all hard-exploration games, with orders of magnitude improvements on the grand challenges Montezuma's Revenge and Pitfall. We also demonstrate the practical potential of Go-Explore on a sparse-reward pick-and-place robotics task. Additionally, we show that adding a goal-conditioned policy can further improve Go-Explore's exploration efficiency and enable it to handle stochasticity throughout training. The substantial performance gains from Go-Explore suggest that the simple principles of remembering states, returning to them, and exploring from them are a powerful and general approach to exploration, an insight that may prove critical to the creation of truly intelligent learning agents.

下载PDF全文

下载文献需遵守相关版权规定

论文标题