基于投影的约束策略优化

论文标题

基于投影的约束策略优化

Projection-Based Constrained Policy Optimization

论文作者

Yang, Tsung-Yen, Rosca, Justinian, Narasimhan, Karthik, Ramadge, Peter J.

论文摘要

我们考虑学习控制政策的问题，该政策优化了奖励功能，同时由于对安全性，公平性或其他成本的考虑而满足了约束。我们提出了一种新的算法，基于投影的约束策略优化（PCPO）。这是一种在两步过程中优化策略的迭代方法：第一步执行局部奖励改进更新，而第二步则通过将策略返回到约束集中来调和任何约束违规行为。我们从理论上分析了PCPO，并为每个策略更新提供了奖励改进的下限，并在违反约束方面提供了上限。我们进一步表征了基于两个不同指标的PCPO的收敛性：$ \ normltwo $ norm和kullback-leibler Divergence。我们对几个控制任务的经验结果表明，与最先进的方法相比，PCPO的平均违规侵犯率较高，违反约束量的3.5倍以上。

We consider the problem of learning control policies that optimize a reward function while satisfying constraints due to considerations of safety, fairness, or other costs. We propose a new algorithm, Projection-Based Constrained Policy Optimization (PCPO). This is an iterative method for optimizing policies in a two-step process: the first step performs a local reward improvement update, while the second step reconciles any constraint violation by projecting the policy back onto the constraint set. We theoretically analyze PCPO and provide a lower bound on reward improvement, and an upper bound on constraint violation, for each policy update. We further characterize the convergence of PCPO based on two different metrics: $\normltwo$ norm and Kullback-Leibler divergence. Our empirical results over several control tasks demonstrate that PCPO achieves superior performance, averaging more than 3.5 times less constraint violation and around 15\% higher reward compared to state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题