在POMDPS中执行几乎可以达到的可达性

论文标题

在POMDPS中执行几乎可以达到的可达性

Enforcing Almost-Sure Reachability in POMDPs

论文作者

Junges, Sebastian, Jansen, Nils, Seshia, Sanjit A.

论文摘要

马尔可夫决策过程（POMDP）是在有限的信息下进行顺序决策的众所周知的随机模型。我们考虑了综合政策的额外时间问题，这些问题几乎达到了一些目标状态，而没有访问不良状态。特别是，我们有兴趣计算获奖区域，即存在满足可及性规范的策略的系统配置集。这样的获胜区域的直接应用是对POMDP的安全探索，例如，将强化学习者的行为限制在该地区。我们提出了两种算法：一种基于SAT的新型迭代方法和一种基于决策的替代方法。经验评估证明了方法的可行性和功效。

Partially-Observable Markov Decision Processes (POMDPs) are a well-known stochastic model for sequential decision making under limited information. We consider the EXPTIME-hard problem of synthesising policies that almost-surely reach some goal state without ever visiting a bad state. In particular, we are interested in computing the winning region, that is, the set of system configurations from which a policy exists that satisfies the reachability specification. A direct application of such a winning region is the safe exploration of POMDPs by, for instance, restricting the behavior of a reinforcement learning agent to the region. We present two algorithms: A novel SAT-based iterative approach and a decision-diagram based alternative. The empirical evaluation demonstrates the feasibility and efficacy of the approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题