通过激励合理的新颖状态来提高参与者批评算法的探索

论文标题

通过激励合理的新颖状态来提高参与者批评算法的探索

Boosting Exploration in Actor-Critic Algorithms by Incentivizing Plausible Novel States

论文作者

Banerjee, Chayan, Chen, Zhiyong, Noman, Nasimul

论文摘要

Actor-Critic（AC）算法是一类无模型的深钢筋学习算法，这些算法证明了它们在不同领域的功效，尤其是在解决持续控制问题方面。使用更有效的样品改善探索（动作熵）和剥削（预期回报）是AC算法中的关键问题。学习算法的一个基本策略是促进偶然探索所有环境状态空间，并鼓励探索很少访问的州而不是经常访问的国家。在这种策略下，我们提出了一种新方法，以固有的奖励来促进探索，以衡量国家的新颖性和探索国家的相关好处（在政策优化方面），这完全称为合理的新颖性。通过激励合理的新颖状态探索，AC算法能够提高其样本效率，从而提高训练性能。通过对各种突出的非政策AC算法的持续控制任务进行广泛的模拟，可以通过广泛的模拟进行新方法验证。

Actor-critic (AC) algorithms are a class of model-free deep reinforcement learning algorithms, which have proven their efficacy in diverse domains, especially in solving continuous control problems. Improvement of exploration (action entropy) and exploitation (expected return) using more efficient samples is a critical issue in AC algorithms. A basic strategy of a learning algorithm is to facilitate indiscriminately exploring all of the environment state space, as well as to encourage exploring rarely visited states rather than frequently visited one. Under this strategy, we propose a new method to boost exploration through an intrinsic reward, based on measurement of a state's novelty and the associated benefit of exploring the state (with regards to policy optimization), altogether called plausible novelty. With incentivized exploration of plausible novel states, an AC algorithm is able to improve its sample efficiency and hence training performance. The new method is verified by extensive simulations of continuous control tasks of MuJoCo environments on a variety of prominent off-policy AC algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题