在上下文匪徒中进行持续动作的本地度量学习

论文标题

在上下文匪徒中进行持续动作的本地度量学习

Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions

论文作者

Lee, Haanvid, Lee, Jongmin, Choi, Yunseon, Jeon, Wonseok, Lee, Byung-Jun, Noh, Yung-Kyun, Kim, Kee-Eung

论文摘要

我们考虑本地内核度量学习，以在具有连续动作空间的上下文匪徒中对确定性政策进行分支机构评估（OPE）。我们的工作是由实际情况的动机，在这种情况下，由于领域的要求，目标政策需要确定性，例如处方治疗剂量和医学持续时间。尽管重要性抽样（IS）为OPE提供了一个基本原则，但它对于通过持续行动的确定性目标政策而言是不适合的。我们的主要思想是放宽目标政策，并将问题作为基于内核的估计，在这里我们学习内核指标，以最大程度地减少总均衡误差（MSE）。我们根据偏差和方差的分析提出了最佳度量的分析解决方案。虽然先前的工作仅限于标量动作空间或内核带宽选择，但我们的工作采取了进一步的步骤，能够进行矢量动作空间和度量优化。我们表明，与基线OPE方法相比，通过对各种域进行的实验，我们的估计器是一致的，并且显着降低了MSE。

We consider local kernel metric learning for off-policy evaluation (OPE) of deterministic policies in contextual bandits with continuous action spaces. Our work is motivated by practical scenarios where the target policy needs to be deterministic due to domain requirements, such as prescription of treatment dosage and duration in medicine. Although importance sampling (IS) provides a basic principle for OPE, it is ill-posed for the deterministic target policy with continuous actions. Our main idea is to relax the target policy and pose the problem as kernel-based estimation, where we learn the kernel metric in order to minimize the overall mean squared error (MSE). We present an analytic solution for the optimal metric, based on the analysis of bias and variance. Whereas prior work has been limited to scalar action spaces or kernel bandwidth selection, our work takes a step further being capable of vector action spaces and metric optimization. We show that our estimator is consistent, and significantly reduces the MSE compared to baseline OPE methods through experiments on various domains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题