论文标题
自我象征优势学习
Self-Imitation Advantage Learning
论文作者
论文摘要
自我象征学习是一种强化学习(RL)方法,它鼓励其回报高于预期的行动,这有助于艰难的探索和稀疏的奖励问题。事实证明,它可以在几种离散的控制任务中提高policy actor-Critic方法的性能。然而,将自我图像应用于基于差异的RL方法的主要动作值并不是一件直接的。我们提出了帆,这是基于贝尔曼最佳操作员的修改,是对贝尔曼最佳操作员的修改,对自我模拟学习的新颖概括为优势学习。至关重要的是,我们的方法通过选择观察到的回报与当前的动作值之间的最乐观的回报估计来减轻陈旧的回报问题。我们展示了帆对街机学习环境的经验有效性,重点是硬探索游戏。
Self-imitation learning is a Reinforcement Learning (RL) method that encourages actions whose returns were higher than expected, which helps in hard exploration and sparse reward problems. It was shown to improve the performance of on-policy actor-critic methods in several discrete control tasks. Nevertheless, applying self-imitation to the mostly action-value based off-policy RL methods is not straightforward. We propose SAIL, a novel generalization of self-imitation learning for off-policy RL, based on a modification of the Bellman optimality operator that we connect to Advantage Learning. Crucially, our method mitigates the problem of stale returns by choosing the most optimistic return estimate between the observed return and the current action-value for self-imitation. We demonstrate the empirical effectiveness of SAIL on the Arcade Learning Environment, with a focus on hard exploration games.