从灾难性的动作效果中学习盾牌：永远不要重复同样的错误

论文标题

从灾难性的动作效果中学习盾牌：永远不要重复同样的错误

Learning a Shield from Catastrophic Action Effects: Never Repeat the Same Mistake

论文作者

Shperberg, Shahaf S., Liu, Bo, Stone, Peter

论文摘要

在未知环境中运作的代理商必定会在学习时犯错，包括至少偶尔会导致灾难性后果。当人类犯了灾难性的错误时，他们应该学会永远不会重复这些错误，例如一个碰到热炉，立即学会再也不会再这样做的孩子。在这项工作中，我们考虑了一类新型的POMDP，称为具有灾难性作用（POMDP-CA）的POMDP，其中成对的状态和动作被标记为灾难性。在POMDP-CA中起作用的代理商对哪种（状态，行动）对是灾难性的先验知识，因此在尝试学习任何有意义的政策时，他们一定会犯错。相反，他们的目的是最大程度地提高奖励，而从未重复错误。作为避免重复错误的第一步，我们利用了防护盾牌的概念，该盾牌阻止代理人执行特定状态的特定动作。特别是，我们存储了代理在数据库中犯的灾难性错误（不安全的状态和动作）。然后禁止代理选择出现在数据库中的动作。这种方法在持续的学习环境中特别有用，在同一基础环境中，一组代理会随着时间的推移执行各种任务。在这种情况下，可以以存储任何代理商犯的错误的方式来构建任务无关的盾牌，以便一旦组中的一个代理犯了一个错误，整个小组就会学会永远不会重复该错误。本文介绍了PPO算法的一种变体，该算法利用此盾牌（称为Shieldppo），并在受控环境中对其进行了经验评估。结果表明，在一系列设置中，SHIELDPPO的表现优于PPO以及安全的加固学习文献中的基线方法。

Agents that operate in an unknown environment are bound to make mistakes while learning, including, at least occasionally, some that lead to catastrophic consequences. When humans make catastrophic mistakes, they are expected to learn never to repeat them, such as a toddler who touches a hot stove and immediately learns never to do so again. In this work we consider a novel class of POMDPs, called POMDP with Catastrophic Actions (POMDP-CA) in which pairs of states and actions are labeled as catastrophic. Agents that act in a POMDP-CA do not have a priori knowledge about which (state, action) pairs are catastrophic, thus they are sure to make mistakes when trying to learn any meaningful policy. Rather, their aim is to maximize reward while never repeating mistakes. As a first step of avoiding mistake repetition, we leverage the concept of a shield which prevents agents from executing specific actions from specific states. In particular, we store catastrophic mistakes (unsafe pairs of states and actions) that agents make in a database. Agents are then forbidden to pick actions that appear in the database. This approach is especially useful in a continual learning setting, where groups of agents perform a variety of tasks over time in the same underlying environment. In this setting, a task-agnostic shield can be constructed in a way that stores mistakes made by any agent, such that once one agent in a group makes a mistake the entire group learns to never repeat that mistake. This paper introduces a variant of the PPO algorithm that utilizes this shield, called ShieldPPO, and empirically evaluates it in a controlled environment. Results indicate that ShieldPPO outperforms PPO, as well as baseline methods from the safe reinforcement learning literature, in a range of settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题