动作候选驱动的剪辑双Q学习，以进行离散和连续的动作任务

论文标题

动作候选驱动的剪辑双Q学习，以进行离散和连续的动作任务

Action Candidate Driven Clipped Double Q-learning for Discrete and Continuous Action Tasks

论文作者

Jiang, Haobo, Xie, Jin, Yang, Jian

论文摘要

双Q学习是马尔可夫决策过程（MDP）问题中流行的强化学习算法。 Clipped Double Q-learning, as an effective variant of Double Q-learning, employs the clipped double estimator to approximate the maximum expected action value. Due to the underestimation bias of the clipped double estimator, the performance of clipped Double Q-learning may be degraded in some stochastic environments.在本文中，为了减少低估偏差，我们提出了一个基于候选动作的剪辑型双重估计器，以进行双Q学习。具体而言，我们首先从一组估计器中选择具有高动作值的一组精英动作候选者。然后，在这些候选人中，我们从其他一组估计器中选择了最高价值的动作。 Finally, we use the maximum value in the second set of estimators to clip the action value of the chosen action in the first set of estimators and the clipped value is used for approximating the maximum expected action value.从理论上讲，随着候选运动数量的减少，我们剪切的双Q学习剪辑中的低估偏差单调衰减。此外，候选人的数量控制着高估和低估偏见之间的权衡。此外，我们还通过近似精英连续的候选者将剪切的双Q学习扩展到连续的动作任务。 We empirically verify that our algorithm can more accurately estimate the maximum expected action value on some toy environments and yield good performance on several benchmark problems.

Double Q-learning is a popular reinforcement learning algorithm in Markov decision process (MDP) problems. Clipped Double Q-learning, as an effective variant of Double Q-learning, employs the clipped double estimator to approximate the maximum expected action value. Due to the underestimation bias of the clipped double estimator, the performance of clipped Double Q-learning may be degraded in some stochastic environments. In this paper, in order to reduce the underestimation bias, we propose an action candidate-based clipped double estimator for Double Q-learning. Specifically, we first select a set of elite action candidates with high action values from one set of estimators. Then, among these candidates, we choose the highest valued action from the other set of estimators. Finally, we use the maximum value in the second set of estimators to clip the action value of the chosen action in the first set of estimators and the clipped value is used for approximating the maximum expected action value. Theoretically, the underestimation bias in our clipped Double Q-learning decays monotonically as the number of action candidates decreases. Moreover, the number of action candidates controls the trade-off between the overestimation and underestimation biases. In addition, we also extend our clipped Double Q-learning to continuous action tasks via approximating the elite continuous action candidates. We empirically verify that our algorithm can more accurately estimate the maximum expected action value on some toy environments and yield good performance on several benchmark problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题