使用内在动机来改善自我记忆学习的探索

论文标题

使用内在动机来改善自我记忆学习的探索

Towards Improving Exploration in Self-Imitation Learning using Intrinsic Motivation

论文作者

Andres, Alain, Villar-Rodriguez, Esther, Del Ser, Javier

论文摘要

强化学习已成为有效地解决优化任务的强大替代方法。这些算法的使用很大程度上取决于环境提供的反馈信号，以告知学习代理商做出的决定的好（或坏）。不幸的是，在广泛的问题中，良好的奖励功能的设计并不是一件容易的事，因此在这种情况下，采用了稀疏的奖励信号。缺乏密集的奖励功能会带来新的挑战，主要与探索有关。模仿学习通过利用专家的示威来解决这些问题。在没有专家（及其后续演示）的情况下，一种选择是优先考虑代理商收集的合理探索经验，以便以良好的探索行为来引导其学习过程。但是，该解决方案在很大程度上取决于代理在其学习过程的早期阶段发现此类轨迹的能力。为了解决这个问题，我们建议将模仿学习与内在动机相结合，这是两种最广泛采用的技术，以解决稀疏奖励的问题。在这项工作中，内在动机用于鼓励代理商根据其好奇心探索环境，而模仿学习允许重复最有前途的体验以加速学习过程。这种组合显示可以在程序生成的环境中提高性能和更好的概括，超过先前报道的自我象征学习方法，并在孤立的固有动机方面实现相等或更好的样品效率。

Reinforcement Learning has emerged as a strong alternative to solve optimization tasks efficiently. The use of these algorithms highly depends on the feedback signals provided by the environment in charge of informing about how good (or bad) the decisions made by the learned agent are. Unfortunately, in a broad range of problems the design of a good reward function is not trivial, so in such cases sparse reward signals are instead adopted. The lack of a dense reward function poses new challenges, mostly related to exploration. Imitation Learning has addressed those problems by leveraging demonstrations from experts. In the absence of an expert (and its subsequent demonstrations), an option is to prioritize well-suited exploration experiences collected by the agent in order to bootstrap its learning process with good exploration behaviors. However, this solution highly depends on the ability of the agent to discover such trajectories in the early stages of its learning process. To tackle this issue, we propose to combine imitation learning with intrinsic motivation, two of the most widely adopted techniques to address problems with sparse reward. In this work intrinsic motivation is used to encourage the agent to explore the environment based on its curiosity, whereas imitation learning allows repeating the most promising experiences to accelerate the learning process. This combination is shown to yield an improved performance and better generalization in procedurally-generated environments, outperforming previously reported self-imitation learning methods and achieving equal or better sample efficiency with respect to intrinsic motivation in isolation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题