论文标题
在时间逻辑约束下,基于RNN的POMDP策略
Verifiable RNN-Based Policies for POMDPs Under Temporal Logic Constraints
论文作者
论文摘要
复发性神经网络(RNN)已成为顺序决策问题中控制策略的有效表示。但是,基于RNN的策略的应用是一个很难提供对行为规格满意的正式保证的困难,例如安全性和/或可及性。通过从形式方法和机器学习中集成技术,我们提出了一种方法,可以从RNN自动提取有限状态控制器(FSC),该方法与有限态系统模型组成时,可以适合现有的正式验证工具。具体来说,我们将迭代修改引入所谓的量化瓶颈插入技术,以创建FSC作为带有内存的随机策略。对于所得FSC无法满足规范的情况,验证会生成诊断信息。我们利用此信息来调整提取的FSC中的内存量,或者对RNN进行重点进行重新培训。虽然通常适用,但我们详细介绍了部分可观察到的马尔可夫决策过程(POMDPS)的策略综合背景下所得的迭代过程,这是众所周知的很难的。数值实验表明,所提出的方法在最佳基准值的2%以内的3个数量级优于传统的POMDP合成方法。
Recurrent neural networks (RNNs) have emerged as an effective representation of control policies in sequential decision-making problems. However, a major drawback in the application of RNN-based policies is the difficulty in providing formal guarantees on the satisfaction of behavioral specifications, e.g. safety and/or reachability. By integrating techniques from formal methods and machine learning, we propose an approach to automatically extract a finite-state controller (FSC) from an RNN, which, when composed with a finite-state system model, is amenable to existing formal verification tools. Specifically, we introduce an iterative modification to the so-called quantized bottleneck insertion technique to create an FSC as a randomized policy with memory. For the cases in which the resulting FSC fails to satisfy the specification, verification generates diagnostic information. We utilize this information to either adjust the amount of memory in the extracted FSC or perform focused retraining of the RNN. While generally applicable, we detail the resulting iterative procedure in the context of policy synthesis for partially observable Markov decision processes (POMDPs), which is known to be notoriously hard. The numerical experiments show that the proposed approach outperforms traditional POMDP synthesis methods by 3 orders of magnitude within 2% of optimal benchmark values.