论文标题
通过贝叶斯世界模型限制了政策优化
Constrained Policy Optimization via Bayesian World Models
论文作者
论文摘要
在高风险现实世界应用中部署强化学习时,提高样本效率和安全性是至关重要的挑战。我们提出了Lambda,这是一种基于模型的新型方法,以通过约束马尔可夫决策过程模拟安全关键任务中的策略优化。我们的方法利用了贝叶斯世界模型,并利用了由此产生的不确定性,以最大程度地提高任务目标的乐观上限,并在安全限制上的悲观上限。我们证明了Lambda在样本效率和违反约束方面的安全性基准套件上的最新表现。
Improving sample-efficiency and safety are crucial challenges when deploying reinforcement learning in high-stakes real world applications. We propose LAMBDA, a novel model-based approach for policy optimization in safety critical tasks modeled via constrained Markov decision processes. Our approach utilizes Bayesian world models, and harnesses the resulting uncertainty to maximize optimistic upper bounds on the task objective, as well as pessimistic upper bounds on the safety constraints. We demonstrate LAMBDA's state of the art performance on the Safety-Gym benchmark suite in terms of sample efficiency and constraint violation.