论文标题
对政策梯度的度量值衍生物的分析
An Analysis of Measure-Valued Derivatives for Policy Gradients
论文作者
论文摘要
由于不断发展更好的政策梯度技术,用于机器人技术的强化学习方法越来越成功。精确的(低方差)和准确(低偏差)梯度估计器对于面对日益复杂的任务至关重要。传统的策略梯度算法使用了可能性比率的技巧,该算法众所周知,它会产生无偏见但较高的差异估计。更现代的方法利用了修复技巧,该技巧给出了较低的方差梯度估计,但需要可区分的值函数近似值。在这项工作中,我们研究了不同类型的随机梯度估计量 - 测量值衍生物。该估计器是公正的,具有较低的方差,可以与可区分和非差异函数近似器一起使用。我们在参与者 - 批判性的策略梯度设置中经验评估了该估计器,并表明它可以根据低维和高维作用空间中的可能性比率或重新训练技巧来实现可比性的性能。通过这项工作,我们想证明,测量值的导数估计器可以是其他策略梯度估计器的有用替代方法。
Reinforcement learning methods for robotics are increasingly successful due to the constant development of better policy gradient techniques. A precise (low variance) and accurate (low bias) gradient estimator is crucial to face increasingly complex tasks. Traditional policy gradient algorithms use the likelihood-ratio trick, which is known to produce unbiased but high variance estimates. More modern approaches exploit the reparametrization trick, which gives lower variance gradient estimates but requires differentiable value function approximators. In this work, we study a different type of stochastic gradient estimator - the Measure-Valued Derivative. This estimator is unbiased, has low variance, and can be used with differentiable and non-differentiable function approximators. We empirically evaluate this estimator in the actor-critic policy gradient setting and show that it can reach comparable performance with methods based on the likelihood-ratio or reparametrization tricks, both in low and high-dimensional action spaces. With this work, we want to show that the Measure-Valued Derivative estimator can be a useful alternative to other policy gradient estimators.