射手：通过减少差异抽样来加强政策评估

论文标题

射手：通过减少差异抽样来加强政策评估

ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling

论文作者

Mukherjee, Subhojyoti, Hanna, Josiah P., Nowak, Robert

论文摘要

本文研究了马尔可夫决策过程（MDPS）中用于政策评估的数据收集问题。在政策评估中，我们获得了目标政策，并要求估计它将在正式作为MDP的环境中获得的预期累积奖励。我们通过首先得出了使用奖励分布方差知识的Oracle数据收集策略来开发在树结构MDPS中的最佳数据收集理论。然后，我们介绍了降低的方差采样（射击）算法，即当奖励差异未知并与Oracle策略相比，奖励方差未知并限制其亚临时性时，它近似于Oracle策略。最后，我们从经验上验证了射击会导致与甲骨文策略相当的均方误差进行政策评估，并且比简单地运行目标策略要低得多。

This paper studies the problem of data collection for policy evaluation in Markov decision processes (MDPs). In policy evaluation, we are given a target policy and asked to estimate the expected cumulative reward it will obtain in an environment formalized as an MDP. We develop theory for optimal data collection within the class of tree-structured MDPs by first deriving an oracle data collection strategy that uses knowledge of the variance of the reward distributions. We then introduce the Reduced Variance Sampling (ReVar) algorithm that approximates the oracle strategy when the reward variances are unknown a priori and bound its sub-optimality compared to the oracle strategy. Finally, we empirically validate that ReVar leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题