论文标题
在非参数模型下的一部分可观察的马尔可夫决策过程的非政策评估
Off-Policy Evaluation for Episodic Partially Observable Markov Decision Processes under Non-Parametric Models
论文作者
论文摘要
我们研究了具有连续状态的可观察到的马尔可夫决策过程(POMDPS)的销售评估问题(OPE)。由最近提出的近端因果推理框架激励,我们开发了一个非参数识别结果,以通过时间依赖性代理变量的帮助通过所谓的V-bridge函数来估算策略值。然后,我们开发一种拟合的Q评估类型算法来递归估算V桥功能,其中每个步骤都解决了非参数仪器变量(NPIV)问题。通过分析这个具有挑战性的顺序NPIV问题,我们建立了用于估计V桥功能的有限样本误差范围,并因此,根据样本量,地平线长度和所谓的(本地)在每个步骤中评估策略价值,以评估策略值。据我们所知,这是非参数模型下POMDP中OPE绑定的第一个有限样本误差。
We study the problem of off-policy evaluation (OPE) for episodic Partially Observable Markov Decision Processes (POMDPs) with continuous states. Motivated by the recently proposed proximal causal inference framework, we develop a non-parametric identification result for estimating the policy value via a sequence of so-called V-bridge functions with the help of time-dependent proxy variables. We then develop a fitted-Q-evaluation-type algorithm to estimate V-bridge functions recursively, where a non-parametric instrumental variable (NPIV) problem is solved at each step. By analyzing this challenging sequential NPIV problem, we establish the finite-sample error bounds for estimating the V-bridge functions and accordingly that for evaluating the policy value, in terms of the sample size, length of horizon and so-called (local) measure of ill-posedness at each step. To the best of our knowledge, this is the first finite-sample error bound for OPE in POMDPs under non-parametric models.