通过原始二线政策优化可证明有效的安全探索

论文标题

通过原始二线政策优化可证明有效的安全探索

Provably Efficient Safe Exploration via Primal-Dual Policy Optimization

论文作者

Ding, Dongsheng, Wei, Xiaohan, Yang, Zhuoran, Wang, Zhaoran, Jovanović, Mihailo R.

论文摘要

我们使用受约束的马尔可夫决策过程（CMDP）制定的安全加强学习（SRL）问题，其中代理的目的是最大化预期的总奖励，但要对公用事业函数的预期总价值受到安全限制。我们专注于具有函数近似的情节设置，其中马尔可夫过渡内核具有线性结构，但对采样模型没有任何其他假设。在此设置下，设计具有可证明的计算和统计效率的SRL算法尤其具有挑战性，因为需要将安全限制和函数近似值纳入基本的利用/勘探权衡。为此，我们提出了一个\下划线{o} ptimistic \ unesionlline {p} rimal- \下划线{d} ual近端策略\ underline {op} timization {opdop）algorithm，其中通过结合最小语策略评估和额外的bonus extornoration来估算值估计的值函数。 We prove that the proposed algorithm achieves an $\tilde{O}(d H^{2.5}\sqrt{T})$ regret and an $\tilde{O}(d H^{2.5}\sqrt{T})$ constraint violation, where $d$ is the dimension of the feature mapping, $H$ is the horizon of each episode, and $ t $是步骤总数。当奖励/实用程序函数固定时，这些界限就会存在，但是每个情节之后的反馈是强盗。我们的界限仅通过特征映射的维度来取决于国家行动空间的能力，因此即使国家的数量进入无穷大，我们的结果也会成立。据我们所知，我们为CMDP提供了第一个可证明有效的在线政策优化算法，并在功能近似设置中提供了安全探索。

We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation in which an agent aims to maximize the expected total reward subject to a safety constraint on the expected total value of a utility function. We focus on an episodic setting with the function approximation where the Markov transition kernels have a linear structure but do not impose any additional assumptions on the sampling model. Designing SRL algorithms with provable computational and statistical efficiency is particularly challenging under this setting because of the need to incorporate both the safety constraint and the function approximation into the fundamental exploitation/exploration tradeoff. To this end, we present an \underline{O}ptimistic \underline{P}rimal-\underline{D}ual Proximal Policy \underline{OP}timization (OPDOP) algorithm where the value function is estimated by combining the least-squares policy evaluation and an additional bonus term for safe exploration. We prove that the proposed algorithm achieves an $\tilde{O}(d H^{2.5}\sqrt{T})$ regret and an $\tilde{O}(d H^{2.5}\sqrt{T})$ constraint violation, where $d$ is the dimension of the feature mapping, $H$ is the horizon of each episode, and $T$ is the total number of steps. These bounds hold when the reward/utility functions are fixed but the feedback after each episode is bandit. Our bounds depend on the capacity of the state-action space only through the dimension of the feature mapping and thus our results hold even when the number of states goes to infinity. To the best of our knowledge, we provide the first provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题