图形备份：利用马尔可夫过渡的数据有效备份

论文标题

图形备份：利用马尔可夫过渡的数据有效备份

Graph Backup: Data Efficient Backup Exploiting Markovian Transitions

论文作者

Jiang, Zhengyao, Zhang, Tianjun, Kirk, Robert, Rocktäschel, Tim, Grefenstette, Edward

论文摘要

深度强化学习（RL）的成功仅限于我们拥有大量在线体验的设置，但是在数据效率的设置中应用RL访问有限的在线互动仍然具有挑战性。数据效率RL的关键是良好的价值估计，但是该空间中的当前方法无法完全利用从环境收集的轨迹数据的结构。在本文中，我们将MDP的过渡数据视为图形，并定义一个新颖的备份操作员Graph Backup，该备用备份将利用此图结构以获得更好的价值估计。与多步备份方法（例如$ n $ -Step $ Q $ - 学习和TD（$λ$））相比，Graph Backup可以执行反事实信用分配，并为状态提供稳定的价值估计，无论状态从哪种轨迹中采样。当与流行的基于价值的方法结合使用时，我们的方法在包括Minigrid，Minatar和Atari100k在内的一组数据效率的RL基准上提供了一步和多步方法的性能。我们通过对Atari Games的过渡图的新颖可视化来进一步分析这种性能提升的原因。

The successes of deep Reinforcement Learning (RL) are limited to settings where we have a large stream of online experiences, but applying RL in the data-efficient setting with limited access to online interactions is still challenging. A key to data-efficient RL is good value estimation, but current methods in this space fail to fully utilise the structure of the trajectory data gathered from the environment. In this paper, we treat the transition data of the MDP as a graph, and define a novel backup operator, Graph Backup, which exploits this graph structure for better value estimation. Compared to multi-step backup methods such as $n$-step $Q$-Learning and TD($λ$), Graph Backup can perform counterfactual credit assignment and gives stable value estimates for a state regardless of which trajectory the state is sampled from. Our method, when combined with popular value-based methods, provides improved performance over one-step and multi-step methods on a suite of data-efficient RL benchmarks including MiniGrid, Minatar and Atari100K. We further analyse the reasons for this performance boost through a novel visualisation of the transition graphs of Atari games.

下载PDF全文

下载文献需遵守相关版权规定

论文标题