论文标题

如何在近端政策优化中启用不确定性估计

How to Enable Uncertainty Estimation in Proximal Policy Optimization

论文作者

Bykovets, Eugene, Metz, Yannick, El-Assady, Mennatallah, Keim, Daniel A., Buhmann, Joachim M.

论文摘要

虽然深厚的加固学习(RL)代理在许多领域都展现了出色的结果,但主要关注的是它们固有的不透明性和在现实世界中用例中这种系统的安全性。为了克服这些问题,我们需要可以量化其不确定性并检测到分布(OOD)状态的代理。现有的不确定性估计技术,例如蒙特卡罗辍学者或深层合奏,在policy深层RL中没有广泛采用。我们认为这是由于两个原因:与监督学习相比,不确定性和OOD状态之类的概念没有很好地定义,尤其是对于On Policy RL方法。其次,RL中不确定性估计方法的可用实施和比较研究受到限制。为了克服第一个差距,我们提出了对参与者 - 批判性RL算法的不确定性和OOD的定义,即近端政策优化(PPO),并提出了可能的适用措施。特别是,我们讨论了价值和政策不确定性的概念。第二点是通过实施不同的不确定性估计方法并在许多环境中进行比较来解决的。对于各种RL环境,通过定制评估基准(ID)和OOD状态评估了OOD检测性能。我们确定奖励和OOD检测性能之间的权衡。为了克服这一点,我们制定了一个帕累托优化问题,在该问题中,我们同时优化了奖励和OOD检测性能。我们在实验上表明,最近提出的掩盖方法在调查方法之间取得了有利的平衡,在匹配原始RL剂的性能的同时,可以实现高质量的不确定性估计和OOD检测。

While deep reinforcement learning (RL) agents have showcased strong results across many domains, a major concern is their inherent opaqueness and the safety of such systems in real-world use cases. To overcome these issues, we need agents that can quantify their uncertainty and detect out-of-distribution (OOD) states. Existing uncertainty estimation techniques, like Monte-Carlo Dropout or Deep Ensembles, have not seen widespread adoption in on-policy deep RL. We posit that this is due to two reasons: concepts like uncertainty and OOD states are not well defined compared to supervised learning, especially for on-policy RL methods. Secondly, available implementations and comparative studies for uncertainty estimation methods in RL have been limited. To overcome the first gap, we propose definitions of uncertainty and OOD for Actor-Critic RL algorithms, namely, proximal policy optimization (PPO), and present possible applicable measures. In particular, we discuss the concepts of value and policy uncertainty. The second point is addressed by implementing different uncertainty estimation methods and comparing them across a number of environments. The OOD detection performance is evaluated via a custom evaluation benchmark of in-distribution (ID) and OOD states for various RL environments. We identify a trade-off between reward and OOD detection performance. To overcome this, we formulate a Pareto optimization problem in which we simultaneously optimize for reward and OOD detection performance. We show experimentally that the recently proposed method of Masksembles strikes a favourable balance among the survey methods, enabling high-quality uncertainty estimation and OOD detection while matching the performance of original RL agents.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源