通过保守的自然政策梯度原始二算法，对受约束的强化学习实现零约束侵犯了零约束

论文标题

通过保守的自然政策梯度原始二算法，对受约束的强化学习实现零约束侵犯了零约束

Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Conservative Natural Policy Gradient Primal-Dual Algorithm

论文作者

Bai, Qinbo, Bedi, Amrit Singh, Aggarwal, Vaneet

论文摘要

我们考虑了在连续的状态行为空间中受约束的马尔可夫决策过程（CMDP）的问题，在该空间中，目标是最大化预期的累积奖励受到某些约束。我们提出了一种新型的保守自然政策梯度原始二算法（C-NPG-PD），以达到零约束违规，同时实现了目标价值函数的最新收敛状态。对于一般策略参数化，我们证明了价值函数与全局最佳功能的融合到由于受限制的策略类而导致的近似错误。我们甚至从$ \ Mathcal {o}（1/ε^6）$到$ \ MATHCAL {O}（1/ε^4）$提高了现有约束NPG-PD算法\ cite {ding2020}的样本复杂性。据我们所知，这是第一项使用自然政策梯度样式算法建立零约束违规的工作，以实现无限的地平线折扣CMDP。我们通过实验评估证明了提出的算法的优点。

We consider the problem of constrained Markov decision process (CMDP) in continuous state-actions spaces where the goal is to maximize the expected cumulative reward subject to some constraints. We propose a novel Conservative Natural Policy Gradient Primal-Dual Algorithm (C-NPG-PD) to achieve zero constraint violation while achieving state of the art convergence results for the objective value function. For general policy parametrization, we prove convergence of value function to global optimal upto an approximation error due to restricted policy class. We even improve the sample complexity of existing constrained NPG-PD algorithm \cite{Ding2020} from $\mathcal{O}(1/ε^6)$ to $\mathcal{O}(1/ε^4)$. To the best of our knowledge, this is the first work to establish zero constraint violation with Natural policy gradient style algorithms for infinite horizon discounted CMDPs. We demonstrate the merits of proposed algorithm via experimental evaluations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题