通过经验重播和适应动作分散的增强学习

论文标题

通过经验重播和适应动作分散的增强学习

Reinforcement learning with experience replay and adaptation of action dispersion

论文作者

Wawrzyński, Paweł, Masarczyk, Wojciech, Ostaszewski, Mateusz

论文摘要

有效的强化学习需要适当的平衡探索和剥削，由动作分布的分散定义。但是，这种平衡取决于任务，学习过程的当前阶段以及当前的环境状态。指定动作分布分散剂的现有方法需要与问题有关的超参数。在本文中，我们建议使用以下原则自动指定动作分布分散：该分布应具有足够的分散，以评估未来的政策。为此，应调整分散剂，以确保重播缓冲区中的动作和产生它们的分布模式的足够高概率（密度），但是这种分散不应更高。这样，可以根据缓冲区的动作有效评估策略，但是当此策略收敛时，动作的探索性随机性会降低。上面的原则在挑战性的基准蚂蚁，Halfcheetah，Hopper和Walker2D上进行了验证，并取得了良好的效果。我们的方法使动作标准偏差会收敛到类似于试验和错误优化的值。

Effective reinforcement learning requires a proper balance of exploration and exploitation defined by the dispersion of action distribution. However, this balance depends on the task, the current stage of the learning process, and the current environment state. Existing methods that designate the action distribution dispersion require problem-dependent hyperparameters. In this paper, we propose to automatically designate the action distribution dispersion using the following principle: This distribution should have sufficient dispersion to enable the evaluation of future policies. To that end, the dispersion should be tuned to assure a sufficiently high probability (densities) of the actions in the replay buffer and the modes of the distributions that generated them, yet this dispersion should not be higher. This way, a policy can be effectively evaluated based on the actions in the buffer, but exploratory randomness in actions decreases when this policy converges. The above principle is verified here on challenging benchmarks Ant, HalfCheetah, Hopper, and Walker2D, with good results. Our method makes the action standard deviations converge to values similar to those resulting from trial-and-error optimization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题