通过限制有条件的价值风险进行安全的加固学习

论文标题

通过限制有条件的价值风险进行安全的加固学习

Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk

论文作者

Ying, Chengyang, Zhou, Xinning, Su, Hang, Yan, Dong, Chen, Ning, Zhu, Jun

论文摘要

尽管深厚的加强学习（DRL）取得了巨大的成功，但由于过渡和观察的内在不确定性，它可能遇到灾难性的失败。大多数现有的安全加固学习方法只能处理过渡干扰或观察障碍，因为这两种干扰会影响代理的不同部分。此外，受欢迎的最差案例回报可能会导致过度悲观的政策。为了解决这些问题，我们首先从理论上证明了在过渡干扰和观察障碍下的性能降解取决于一个新颖的价值函数范围（VFR），这与最佳状态和最差状态之间的价值函数的差距相对应。基于分析，我们采用有条件的价值风险（CVAR）作为对风险的评估，并提出了一种新颖的加强学习算法的CVAR-Proximal-Policy-oftimization（CPPO），通过将其CVAR保持在给定的阈值下，从而使风险敏感的约束优化问题正式化。实验结果表明，CPPO获得了更高的累积奖励，并且在Mujoco中一系列连续控制任务上的观察和过渡干扰更为强大。

Though deep reinforcement learning (DRL) has obtained substantial success, it may encounter catastrophic failures due to the intrinsic uncertainty of both transition and observation. Most of the existing methods for safe reinforcement learning can only handle transition disturbance or observation disturbance since these two kinds of disturbance affect different parts of the agent; besides, the popular worst-case return may lead to overly pessimistic policies. To address these issues, we first theoretically prove that the performance degradation under transition disturbance and observation disturbance depends on a novel metric of Value Function Range (VFR), which corresponds to the gap in the value function between the best state and the worst state. Based on the analysis, we adopt conditional value-at-risk (CVaR) as an assessment of risk and propose a novel reinforcement learning algorithm of CVaR-Proximal-Policy-Optimization (CPPO) which formalizes the risk-sensitive constrained optimization problem by keeping its CVaR under a given threshold. Experimental results show that CPPO achieves a higher cumulative reward and is more robust against both observation and transition disturbances on a series of continuous control tasks in MuJoCo.

下载PDF全文

下载文献需遵守相关版权规定

论文标题