论文标题
Bootstrap优势在强化学习中的策略优化估计
Bootstrap Advantage Estimation for Policy Optimization in Reinforcement Learning
论文作者
论文摘要
本文提出了一种基于数据增强数据的优势估计方法。与在输入上使用数据增强以将价值和策略函数作为现有方法使用不同,我们的方法使用数据增强来计算Bootstrap优势估计。然后将此引导优势估计(BAE)用于学习和更新策略和价值功能的梯度。为了证明我们的方法的有效性,我们对几种环境进行了实验。这些环境来自三个基准:Procgen,DeepMind Control和Pybullet,其中包括基于图像和矢量的观测值;离散和连续的动作空间。我们观察到,我们的方法比广义优势估计方法(GAE)方法可以降低策略和价值损失,并最终改善累积回报。此外,我们的方法的性能要比最近提出的两种数据增强技术(RAD和DRAC)更好。总体而言,我们的方法在样本效率和概括中的经验性比基线更好,在这种效率和泛化中,在看不见的环境中测试了代理。
This paper proposes an advantage estimation approach based on data augmentation for policy optimization. Unlike using data augmentation on the input to learn value and policy function as existing methods use, our method uses data augmentation to compute a bootstrap advantage estimation. This Bootstrap Advantage Estimation (BAE) is then used for learning and updating the gradient of policy and value function. To demonstrate the effectiveness of our approach, we conducted experiments on several environments. These environments are from three benchmarks: Procgen, Deepmind Control, and Pybullet, which include both image and vector-based observations; discrete and continuous action spaces. We observe that our method reduces the policy and the value loss better than the Generalized advantage estimation (GAE) method and eventually improves cumulative return. Furthermore, our method performs better than two recently proposed data augmentation techniques (RAD and DRAC). Overall, our method performs better empirically than baselines in sample efficiency and generalization, where the agent is tested in unseen environments.