论文标题
人口指导的并行政策搜索强化学习
Population-Guided Parallel Policy Search for Reinforcement Learning
论文作者
论文摘要
在本文中,提出了一种新的人口引导的平行学习计划,以增强政策外增强学习(RL)的表现。在拟议的计划中,多个相同的学习者与自己的价值功能和政策共享共同的经验重播缓冲区,并与最佳政策信息的指导一起搜索一项良好的政策。关键点是,最佳政策的信息通过构建增强的损失函数来融合,以使策略更新以扩大多个学习者的整体搜索区域。以前的最佳政策和扩大范围的指导可以更快,更好的政策搜索。从理论上证明了拟议方案对预期累积回报的单调改进。工作算法是通过将提出的方案应用于双重延迟的深层确定性(TD3)策略梯度算法来构建的。数值结果表明,构造算法的表现要优于大多数当前最新RL算法,并且在稀疏奖励环境的情况下,增益很大。
In this paper, a new population-guided parallel learning scheme is proposed to enhance the performance of off-policy reinforcement learning (RL). In the proposed scheme, multiple identical learners with their own value-functions and policies share a common experience replay buffer, and search a good policy in collaboration with the guidance of the best policy information. The key point is that the information of the best policy is fused in a soft manner by constructing an augmented loss function for policy update to enlarge the overall search region by the multiple learners. The guidance by the previous best policy and the enlarged range enable faster and better policy search. Monotone improvement of the expected cumulative return by the proposed scheme is proved theoretically. Working algorithms are constructed by applying the proposed scheme to the twin delayed deep deterministic (TD3) policy gradient algorithm. Numerical results show that the constructed algorithm outperforms most of the current state-of-the-art RL algorithms, and the gain is significant in the case of sparse reward environment.