论文标题
离线加强学习中自然随机政策的有效评估
Efficient Evaluation of Natural Stochastic Policies in Offline Reinforcement Learning
论文作者
论文摘要
我们研究了自然随机策略的有效非政策评估,这些评估是根据行为政策偏差来定义的。这与有关非政策评估的文献背道而驰,其中大多数工作都考虑对明确指定的政策进行评估。至关重要的是,通过自然随机政策的离线加强学习可以帮助减轻重叠薄弱的问题,导致基于当前实践的政策,并提高政策在实践中的可实施性。与预先指定的评估策略的经典案例相比,在评估自然随机策略时,由于评估策略本身未知,因此效率结合(测量最佳估计误差)被夸大了。在本文中,我们得出了两种主要类型的自然随机策略的效率界限:倾斜政策和改良的治疗策略。然后,我们提出有效的非参数估计器,该估计量在非常松弛的条件下达到效率界限。这些也享有(部分)双重鲁棒性属性。
We study the efficient off-policy evaluation of natural stochastic policies, which are defined in terms of deviations from the behavior policy. This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies. Crucially, offline reinforcement learning with natural stochastic policies can help alleviate issues of weak overlap, lead to policies that build upon current practice, and improve policies' implementability in practice. Compared with the classic case of a pre-specified evaluation policy, when evaluating natural stochastic policies, the efficiency bound, which measures the best-achievable estimation error, is inflated since the evaluation policy itself is unknown. In this paper, we derive the efficiency bounds of two major types of natural stochastic policies: tilting policies and modified treatment policies. We then propose efficient nonparametric estimators that attain the efficiency bounds under very lax conditions. These also enjoy a (partial) double robustness property.