论文标题
从基于梯度的学习者学习的逆增强学习
Inverse Reinforcement Learning from a Gradient-based Learner
论文作者
论文摘要
逆增强学习解决了从演示中推断专家的奖励功能的问题。但是,在许多应用中,我们不仅可以访问专家的近乎最佳行为,而且我们还可以观察到她的学习过程的一部分。在本文中,我们为这种环境提出了一种新的算法,在该设置中,鉴于学习过程中产生的一系列策略,目标是恢复代理商优化的奖励功能。我们的方法基于以下假设:观察到的代理正在沿梯度方向更新其策略参数。然后,我们扩展了我们仅能访问学习轨迹数据集的更现实的方案的方法。对于这两种设置,我们都提供了对算法性能的理论见解。最后,我们在模拟的网格世界环境和Mujoco环境中评估了该方法,并将其与最先进的基线进行了比较。
Inverse Reinforcement Learning addresses the problem of inferring an expert's reward function from demonstrations. However, in many applications, we not only have access to the expert's near-optimal behavior, but we also observe part of her learning process. In this paper, we propose a new algorithm for this setting, in which the goal is to recover the reward function being optimized by an agent, given a sequence of policies produced during learning. Our approach is based on the assumption that the observed agent is updating her policy parameters along the gradient direction. Then we extend our method to deal with the more realistic scenario where we only have access to a dataset of learning trajectories. For both settings, we provide theoretical insights into our algorithms' performance. Finally, we evaluate the approach in a simulated GridWorld environment and on the MuJoCo environments, comparing it with the state-of-the-art baseline.