由Q构成的Q学习：稳定无模型的非政策深度加固学习

论文标题

由Q构成的Q学习：稳定无模型的非政策深度加固学习

Qgraph-bounded Q-learning: Stabilizing Model-Free Off-Policy Deep Reinforcement Learning

论文作者

Hoppe, Sabrina, Toussaint, Marc

论文摘要

在最先进的无模型的非政策外强加固学习中，重播记忆用于存储过去的经验并得出所有网络更新。即使状态和动作空间都是连续的，重播存储器也只能保持有限数量的过渡。我们在数据图中表示这些过渡，并将其结构与软差异联系起来。通过选择具有有利结构的子图，我们构建了一个简化的马尔可夫决策过程，随着更多数据的进来，可以有效地计算出精确的Q值。子图及其相关的Q值可以表示为QGraph。我们表明，简化MDP中每个过渡的Q值是原始连续Q学习问题中相同过渡的Q值的下限。通过在时间差学习中使用这些下限，我们的方法QG-DDPG较不容易发生柔软差异，并且表现出提高的样品效率，同时对超参数更强大。 QGraphs还保留了已经在重播内存中已经覆盖的过渡中的信息，这可以降低该算法对重播记忆容量的敏感性。

In state of the art model-free off-policy deep reinforcement learning, a replay memory is used to store past experience and derive all network updates. Even if both state and action spaces are continuous, the replay memory only holds a finite number of transitions. We represent these transitions in a data graph and link its structure to soft divergence. By selecting a subgraph with a favorable structure, we construct a simplified Markov Decision Process for which exact Q-values can be computed efficiently as more data comes in. The subgraph and its associated Q-values can be represented as a QGraph. We show that the Q-value for each transition in the simplified MDP is a lower bound of the Q-value for the same transition in the original continuous Q-learning problem. By using these lower bounds in temporal difference learning, our method QG-DDPG is less prone to soft divergence and exhibits increased sample efficiency while being more robust to hyperparameters. QGraphs also retain information from transitions that have already been overwritten in the replay memory, which can decrease the algorithm's sensitivity to the replay memory capacity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题