部分可观测时空混沌系统的无模型预测

论文标题

部分可观测时空混沌系统的无模型预测

Safe Reinforcement Learning From Pixels Using a Stochastic Latent Representation

论文作者

Hogewind, Yannick, Simao, Thiago D., Kachman, Tal, Jansen, Nils

论文摘要

我们解决了从像素观察结果中安全加强学习的问题。在这种情况下，固有的挑战是（1）奖励优化和坚持安全限制，（2）部分可观察性和（3）高维观察之间的权衡。我们以受约束的，部分可观察到的马尔可夫决策过程框架正式化了问题，在该框架中，代理商获得了独特的奖励和安全信号。为了应对维度的诅咒，我们采用了一个新型的安全批评家，使用随机潜在参与者 - 批评者（SLAC）方法。潜在变量模型可以预测奖励和安全违规，我们使用安全评论家来培训安全政策。使用众所周知的基准环境，我们证明了有关计算要求，最终奖励回报和满足安全限制的现有方法的竞争性能。

We address the problem of safe reinforcement learning from pixel observations. Inherent challenges in such settings are (1) a trade-off between reward optimization and adhering to safety constraints, (2) partial observability, and (3) high-dimensional observations. We formalize the problem in a constrained, partially observable Markov decision process framework, where an agent obtains distinct reward and safety signals. To address the curse of dimensionality, we employ a novel safety critic using the stochastic latent actor-critic (SLAC) approach. The latent variable model predicts rewards and safety violations, and we use the safety critic to train safe policies. Using well-known benchmark environments, we demonstrate competitive performance over existing approaches with respects to computational requirements, final reward return, and satisfying the safety constraints.

下载PDF全文

下载文献需遵守相关版权规定

论文标题