论文标题

基于投影的约束策略优化

Projection-Based Constrained Policy Optimization

论文作者

Yang, Tsung-Yen, Rosca, Justinian, Narasimhan, Karthik, Ramadge, Peter J.

论文摘要

我们考虑学习控制政策的问题,该政策优化了奖励功能,同时由于对安全性,公平性或其他成本的考虑而满足了约束。我们提出了一种新的算法,基于投影的约束策略优化(PCPO)。这是一种在两步过程中优化策略的迭代方法:第一步执行局部奖励改进更新,而第二步则通过将策略返回到约束集中来调和任何约束违规行为。我们从理论上分析了PCPO,并为每个策略更新提供了奖励改进的下限,并在违反约束方面提供了上限。我们进一步表征了基于两个不同指标的PCPO的收敛性:$ \ normltwo $ norm和kullback-leibler Divergence。我们对几个控制任务的经验结果表明,与最先进的方法相比,PCPO的平均违规侵犯率较高,违反约束量的3.5倍以上。

We consider the problem of learning control policies that optimize a reward function while satisfying constraints due to considerations of safety, fairness, or other costs. We propose a new algorithm, Projection-Based Constrained Policy Optimization (PCPO). This is an iterative method for optimizing policies in a two-step process: the first step performs a local reward improvement update, while the second step reconciles any constraint violation by projecting the policy back onto the constraint set. We theoretically analyze PCPO and provide a lower bound on reward improvement, and an upper bound on constraint violation, for each policy update. We further characterize the convergence of PCPO based on two different metrics: $\normltwo$ norm and Kullback-Leibler divergence. Our empirical results over several control tasks demonstrate that PCPO achieves superior performance, averaging more than 3.5 times less constraint violation and around 15\% higher reward compared to state-of-the-art methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源