基于减少方差的部分轨迹重用以加速政策梯度优化

论文标题

基于减少方差的部分轨迹重用以加速政策梯度优化

Variance Reduction based Partial Trajectory Reuse to Accelerate Policy Gradient Optimization

论文作者

Zheng, Hua, Xie, Wei

论文摘要

基于我们先前关于绿色模拟辅助政策梯度（GS-PG）的研究，重点是基于轨迹的重复使用，在本文中，我们考虑了无限 - 马尔可夫马尔可夫决策过程，并创建了一种新的基于重要性采样的策略梯度优化方法来支持动态决策。现有的GS-PG方法旨在从完整的剧集或过程轨迹中学习，这将其适用性限制在低数据表中和灵活的在线过程控制中。为了克服这一局限性，提出的方法可以选择性地重复使用最相关的部分轨迹，即，重用单元基于每步或每次任务的历史观察。具体而言，我们创建了基于混合的可能性比率（MLR）策略梯度优化，该优化可以利用不同行为政策下产生的历史状态行动转变中的信息。提出的降低差异经验重播（VRER）方法可以智能地选择和重复使用最相关的过渡观察，改善策略梯度估计并加速最佳政策的学习。我们的实证研究表明，它可以改善优化融合并增强最先进的政策优化方法的性能，例如Actor-Critic方法和近端政策优化。

Built on our previous study on green simulation assisted policy gradient (GS-PG) focusing on trajectory-based reuse, in this paper, we consider infinite-horizon Markov Decision Processes and create a new importance sampling based policy gradient optimization approach to support dynamic decision making. The existing GS-PG method was designed to learn from complete episodes or process trajectories, which limits its applicability to low-data situations and flexible online process control. To overcome this limitation, the proposed approach can selectively reuse the most related partial trajectories, i.e., the reuse unit is based on per-step or per-decision historical observations. In specific, we create a mixture likelihood ratio (MLR) based policy gradient optimization that can leverage the information from historical state-action transitions generated under different behavioral policies. The proposed variance reduction experience replay (VRER) approach can intelligently select and reuse most relevant transition observations, improve the policy gradient estimation, and accelerate the learning of optimal policy. Our empirical study demonstrates that it can improve optimization convergence and enhance the performance of state-of-the-art policy optimization approaches such as actor-critic method and proximal policy optimizations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题