SAAC：安全的加强学习是演员恐怖的对抗性游戏

论文标题

SAAC：安全的加强学习是演员恐怖的对抗性游戏

SAAC: Safe Reinforcement Learning as an Adversarial Game of Actor-Critics

论文作者

Flet-Berliac, Yannis, Basu, Debabrota

论文摘要

尽管增强学习（RL）对于不确定性下的顺序决策问题有效，但在风险或安全性是具有约束力约束的现实系统中，它仍然无法蓬勃发展。在本文中，我们将安全限制作为非零和游戏制定了RL问题。在用最大熵RL部署的同时，此配方会导致一个安全的对手引导的软角色批评框架，称为SAAC。在SAAC中，对手旨在打破安全限制，而RL代理旨在根据对手的策略最大化约束的价值功能。对代理的价值函数的安全限制仅表现为代理和对手政策之间的排斥项。与以前的方法不同，SAAC可以解决不同的安全标准，例如安全探索，均值差异风险敏感性以及类似CVAR的相干风险敏感性。我们说明了这些约束的对手的设计。然后，在每种变化中，我们都表明，除了学习解决任务外，代理与对手的不安全行为不同。最后，对于具有挑战性的持续控制任务，我们证明SAAC可以实现更快的融合，提高效率和更少的失败以满足安全性限制，而不是风险避免风险的分布RL和风险中立的软性参与者批判性算法。

Although Reinforcement Learning (RL) is effective for sequential decision-making problems under uncertainty, it still fails to thrive in real-world systems where risk or safety is a binding constraint. In this paper, we formulate the RL problem with safety constraints as a non-zero-sum game. While deployed with maximum entropy RL, this formulation leads to a safe adversarially guided soft actor-critic framework, called SAAC. In SAAC, the adversary aims to break the safety constraint while the RL agent aims to maximize the constrained value function given the adversary's policy. The safety constraint on the agent's value function manifests only as a repulsion term between the agent's and the adversary's policies. Unlike previous approaches, SAAC can address different safety criteria such as safe exploration, mean-variance risk sensitivity, and CVaR-like coherent risk sensitivity. We illustrate the design of the adversary for these constraints. Then, in each of these variations, we show the agent differentiates itself from the adversary's unsafe actions in addition to learning to solve the task. Finally, for challenging continuous control tasks, we demonstrate that SAAC achieves faster convergence, better efficiency, and fewer failures to satisfy the safety constraints than risk-averse distributional RL and risk-neutral soft actor-critic algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题