通过二进制反馈的可证明的安全加强学习

论文标题

通过二进制反馈的可证明的安全加强学习

Provable Safe Reinforcement Learning with Binary Feedback

论文作者

Bennett, Andrew, Misra, Dipendra, Kallus, Nathan

论文摘要

在强化学习（RL）（无论是机器人，汽车还是医疗）的许多应用中，安全是至关重要的。许多现有的安全RL方法都依赖于接收数字安全反馈，但是在许多情况下，这种反馈只能采用二进制值。也就是说，在给定状态下的行动是安全的还是不安全的。当反馈来自人类专家时，尤其如此。因此，我们考虑可证明的安全RL的问题：访问离线甲骨文，以提供有关国家安全，动作对的二进制反馈。我们提供了一种新颖的元算法，SABER，可以将其应用于该设置的BlackBox PAC RL算法的任何MDP设置。 Saber应用了从积极学习到加强学习的概念，可以证明控制安全甲骨文的查询数量。 Saber通过迭代探索状态空间来寻找代理商目前不确定安全的区域而进行的工作。我们的主要理论结果表明，在适当的技术假设下，Saber在培训期间从不采取不安全的行动，并且可以保证以很高的可能性返回近距离的安全政策。我们提供了有关如何将我们的元算法应用于理论和经验框架中研究的各种环境的讨论。

Safety is a crucial necessity in many applications of reinforcement learning (RL), whether robotic, automotive, or medical. Many existing approaches to safe RL rely on receiving numeric safety feedback, but in many cases this feedback can only take binary values; that is, whether an action in a given state is safe or unsafe. This is particularly true when feedback comes from human experts. We therefore consider the problem of provable safe RL when given access to an offline oracle providing binary feedback on the safety of state, action pairs. We provide a novel meta algorithm, SABRE, which can be applied to any MDP setting given access to a blackbox PAC RL algorithm for that setting. SABRE applies concepts from active learning to reinforcement learning to provably control the number of queries to the safety oracle. SABRE works by iteratively exploring the state space to find regions where the agent is currently uncertain about safety. Our main theoretical results shows that, under appropriate technical assumptions, SABRE never takes unsafe actions during training, and is guaranteed to return a near-optimal safe policy with high probability. We provide a discussion of how our meta-algorithm may be applied to various settings studied in both theoretical and empirical frameworks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题