非上政策和软剪辑的充分性：根据违反政策的措施，PPO仍然不足

论文标题

非上政策和软剪辑的充分性：根据违反政策的措施，PPO仍然不足

The Sufficiency of Off-Policyness and Soft Clipping: PPO is still Insufficient according to an Off-Policy Measure

论文作者

Chen, Xing, Diao, Dongcui, Chen, Hechang, Yao, Hengshuai, Piao, Haiyin, Sun, Zhixiao, Yang, Zhiwei, Goebel, Randy, Jiang, Bei, Chang, Yi

论文摘要

流行的近端策略优化（PPO）算法近似于剪辑策略空间中的解决方案。在这个空间之外是否有更好的政策？通过使用一个采用Sigmoid功能的新型替代目标（提供了一种有趣的探索方式），我们发现答案是``是''，而更好的政策实际上距离陷入困境的空间很远。我们表明，根据一个名为DEON的非政策指标，PPO在``非政策''中不足。我们的算法在一个比PPO大得多的政策空间中探索，并且在培训期间，它比PPO更好地提高了保守政策迭代（CPI）的目标。据我们所知，所有当前的PPO方法都具有剪接操作并在剪裁的策略空间中进行了优化。我们的方法是这种方法中的第一个，它可以提高对CPI优化和策略梯度方法的理解。代码可从https://github.com/raincchio/p3o获得。

The popular Proximal Policy Optimization (PPO) algorithm approximates the solution in a clipped policy space. Does there exist better policies outside of this space? By using a novel surrogate objective that employs the sigmoid function (which provides an interesting way of exploration), we found that the answer is ``YES'', and the better policies are in fact located very far from the clipped space. We show that PPO is insufficient in ``off-policyness'', according to an off-policy metric called DEON. Our algorithm explores in a much larger policy space than PPO, and it maximizes the Conservative Policy Iteration (CPI) objective better than PPO during training. To the best of our knowledge, all current PPO methods have the clipping operation and optimize in the clipped policy space. Our method is the first of this kind, which advances the understanding of CPI optimization and policy gradient methods. Code is available at https://github.com/raincchio/P3O.

下载PDF全文

下载文献需遵守相关版权规定

论文标题