在非参数模型下的一部分可观察的马尔可夫决策过程的非政策评估

论文标题

在非参数模型下的一部分可观察的马尔可夫决策过程的非政策评估

Off-Policy Evaluation for Episodic Partially Observable Markov Decision Processes under Non-Parametric Models

论文作者

Miao, Rui, Qi, Zhengling, Zhang, Xiaoke

论文摘要

我们研究了具有连续状态的可观察到的马尔可夫决策过程（POMDPS）的销售评估问题（OPE）。由最近提出的近端因果推理框架激励，我们开发了一个非参数识别结果，以通过时间依赖性代理变量的帮助通过所谓的V-bridge函数来估算策略值。然后，我们开发一种拟合的Q评估类型算法来递归估算V桥功能，其中每个步骤都解决了非参数仪器变量（NPIV）问题。通过分析这个具有挑战性的顺序NPIV问题，我们建立了用于估计V桥功能的有限样本误差范围，并因此，根据样本量，地平线长度和所谓的（本地）在每个步骤中评估策略价值，以评估策略值。据我们所知，这是非参数模型下POMDP中OPE绑定的第一个有限样本误差。

We study the problem of off-policy evaluation (OPE) for episodic Partially Observable Markov Decision Processes (POMDPs) with continuous states. Motivated by the recently proposed proximal causal inference framework, we develop a non-parametric identification result for estimating the policy value via a sequence of so-called V-bridge functions with the help of time-dependent proxy variables. We then develop a fitted-Q-evaluation-type algorithm to estimate V-bridge functions recursively, where a non-parametric instrumental variable (NPIV) problem is solved at each step. By analyzing this challenging sequential NPIV problem, we establish the finite-sample error bounds for estimating the V-bridge functions and accordingly that for evaluating the policy value, in terms of the sample size, length of horizon and so-called (local) measure of ill-posedness at each step. To the best of our knowledge, this is the first finite-sample error bound for OPE in POMDPs under non-parametric models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题