确定性政策的双重稳健的非政策值和梯度估计

论文标题

确定性政策的双重稳健的非政策值和梯度估计

Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies

论文作者

Kallus, Nathan, Uehara, Masatoshi

论文摘要

离线加强学习，其中，人们使用固定行为政策记录的违反政策数据来评估和学习新政策，在实验受到限制（例如医学）的应用中至关重要。我们研究行动连续的策略价值的估计和确定性政策的梯度的估计。针对确定性策略（对于哪个行动是国家的确定性函数）至关重要，因为最佳策略始终是确定性的（最多纽带）。在这种情况下，由于密度比不存在，因此标准重要性采样和双重强大的估计器对策略价值和梯度失败。为了解决这个问题，我们根据不同的内核方法提出了几个新的双重稳定估计器。我们分析了在轻度速率条件下的累积估计量，分析了这些渐变的均值误差。具体而言，我们演示了如何获得独立于地平线长度的速率。

Offline reinforcement learning, wherein one uses off-policy data logged by a fixed behavior policy to evaluate and learn new policies, is crucial in applications where experimentation is limited such as medicine. We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous. Targeting deterministic policies, for which action is a deterministic function of state, is crucial since optimal policies are always deterministic (up to ties). In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist. To circumvent this issue, we propose several new doubly robust estimators based on different kernelization approaches. We analyze the asymptotic mean-squared error of each of these under mild rate conditions for nuisance estimators. Specifically, we demonstrate how to obtain a rate that is independent of the horizon length.

下载PDF全文

下载文献需遵守相关版权规定

论文标题