论文标题
部分可观测时空混沌系统的无模型预测
Safe Reinforcement Learning From Pixels Using a Stochastic Latent Representation
论文作者
论文摘要
我们解决了从像素观察结果中安全加强学习的问题。在这种情况下,固有的挑战是(1)奖励优化和坚持安全限制,(2)部分可观察性和(3)高维观察之间的权衡。我们以受约束的,部分可观察到的马尔可夫决策过程框架正式化了问题,在该框架中,代理商获得了独特的奖励和安全信号。为了应对维度的诅咒,我们采用了一个新型的安全批评家,使用随机潜在参与者 - 批评者(SLAC)方法。潜在变量模型可以预测奖励和安全违规,我们使用安全评论家来培训安全政策。使用众所周知的基准环境,我们证明了有关计算要求,最终奖励回报和满足安全限制的现有方法的竞争性能。
We address the problem of safe reinforcement learning from pixel observations. Inherent challenges in such settings are (1) a trade-off between reward optimization and adhering to safety constraints, (2) partial observability, and (3) high-dimensional observations. We formalize the problem in a constrained, partially observable Markov decision process framework, where an agent obtains distinct reward and safety signals. To address the curse of dimensionality, we employ a novel safety critic using the stochastic latent actor-critic (SLAC) approach. The latent variable model predicts rewards and safety violations, and we use the safety critic to train safe policies. Using well-known benchmark environments, we demonstrate competitive performance over existing approaches with respects to computational requirements, final reward return, and satisfying the safety constraints.