日出：一个简单的统一框架，用于深入强化学习中的合奏学习

论文标题

日出：一个简单的统一框架，用于深入强化学习中的合奏学习

SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning

论文作者

Lee, Kimin, Laskin, Michael, Srinivas, Aravind, Abbeel, Pieter

论文摘要

非政策深度加固学习（RL）在一系列具有挑战性的领域中取得了成功。但是，标准的非政策RL算法可能会遇到多个问题，例如在Q学习和平衡探索和剥削方面的不稳定。为了减轻这些问题，我们提出了日出，这是一种简单的统一集合方法，它与各种非政策的RL算法兼容。 Sunrise整合了两种关键成分：（a）基于合奏的加权式式式备份，该备用基于Q-安装的不确定性估计，以及（b）一种推理方法，该推理方法使用最高的上限界限选择动作以进行有效探索。通过使用Bootstrap和随机初始化的代理之间的多样性，我们表明，这些不同的想法在很大程度上是正交的，并且可以有效地整合，从而进一步改善了现有的非政策RL算法的性能，例如软性参与者 - critic-Critic和Rainbow DQN，无论是连续和高dimensionality and Invorymitional and Invorymitional contimentional和Rainbow DQN。我们的培训代码可在https://github.com/pokaxpoka/sunrise上找到。

Off-policy deep reinforcement learning (RL) has been successful in a range of challenging domains. However, standard off-policy RL algorithms can suffer from several issues, such as instability in Q-learning and balancing exploration and exploitation. To mitigate these issues, we present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy RL algorithms. SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration. By enforcing the diversity between agents using Bootstrap with random initialization, we show that these different ideas are largely orthogonal and can be fruitfully integrated, together further improving the performance of existing off-policy RL algorithms, such as Soft Actor-Critic and Rainbow DQN, for both continuous and discrete control tasks on both low-dimensional and high-dimensional environments. Our training code is available at https://github.com/pokaxpoka/sunrise.

下载PDF全文

下载文献需遵守相关版权规定

论文标题