ddpg ++：在连续控制外部强化学习中努力简单

论文标题

ddpg ++：在连续控制外部强化学习中努力简单

DDPG++: Striving for Simplicity in Continuous-control Off-Policy Reinforcement Learning

论文作者

Fakoor, Rasool, Chaudhari, Pratik, Smola, Alexander J.

论文摘要

本文规定了一系列用于政体增强学习（RL）的技术，这些技术简化了训练过程并降低了样本的复杂性。首先，我们表明，只要控制高估偏差，简单的确定性策略梯度就非常有效。这与现有文献形成了鲜明的对比，后者创造了复杂的非政治技术。其次，我们查明培训不算法（典型的非政策算法）到贪婪的政策更新步骤；现有的解决方案（例如延迟政策更新）不会减轻此问题。第三，我们表明，倾向估计文献中的想法可用于从重播缓冲区进行重要的样本过渡，并有选择地更新策略以防止绩效恶化。我们在一组具有挑战性的Mujoco任务上使用广泛的实验提出这些主张。可以在https://tinyurl.com/scs6p5m上看到我们结果的简短视频。

This paper prescribes a suite of techniques for off-policy Reinforcement Learning (RL) that simplify the training process and reduce the sample complexity. First, we show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled. This is contrast to existing literature which creates sophisticated off-policy techniques. Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step; existing solutions such as delayed policy updates do not mitigate this issue. Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from the replay buffer and selectively update the policy to prevent deterioration of performance. We make these claims using extensive experimentation on a set of challenging MuJoCo tasks. A short video of our results can be seen at https://tinyurl.com/scs6p5m .

下载PDF全文

下载文献需遵守相关版权规定

论文标题