在增强学习中学习内在的符号奖励

论文标题

在增强学习中学习内在的符号奖励

Learning Intrinsic Symbolic Rewards in Reinforcement Learning

论文作者

Sheikh, Hassam, Khadka, Shauharda, Miret, Santiago, Majumdar, Somdeb

论文摘要

学习稀疏目标的有效政策是深度加强学习（RL）的关键挑战。一种常见的方法是设计与任务相关的密集奖励以提高任务可学习性。尽管很容易解释这种奖励，但它们依靠启发式和领域的专业知识。训练神经网络以发现密集的替代奖励的替代方法避免了启发式方法，但是高维的黑盒解决方案，几乎没有可解释性。在本文中，我们提出了一种以低维符号树的形式发现密集奖励的方法，从而使它们更容易进行分析。树木使用简单的功能操作员将代理的观察结果映射到标量奖励，然后监督神经网络政策的政策梯度学习。我们测试了在Mujoco中的连续动作空间以及Atari和Pygame环境中离散的动作空间的方法。我们表明，发现的密集奖励是RL政策解决基准任务的有效信号。值得注意的是，在所有考虑的环境中，我们的表现都胜过了广泛使用的现代神经网络奖励算法。

Learning effective policies for sparse objectives is a key challenge in Deep Reinforcement Learning (RL). A common approach is to design task-related dense rewards to improve task learnability. While such rewards are easily interpreted, they rely on heuristics and domain expertise. Alternate approaches that train neural networks to discover dense surrogate rewards avoid heuristics, but are high-dimensional, black-box solutions offering little interpretability. In this paper, we present a method that discovers dense rewards in the form of low-dimensional symbolic trees - thus making them more tractable for analysis. The trees use simple functional operators to map an agent's observations to a scalar reward, which then supervises the policy gradient learning of a neural network policy. We test our method on continuous action spaces in Mujoco and discrete action spaces in Atari and Pygame environments. We show that the discovered dense rewards are an effective signal for an RL policy to solve the benchmark tasks. Notably, we significantly outperform a widely used, contemporary neural-network based reward-discovery algorithm in all environments considered.

下载PDF全文

下载文献需遵守相关版权规定

论文标题