论文标题
在线决策的统计推断:在上下文的强盗设置中
Statistical Inference for Online Decision-Making: In a Contextual Bandit Setting
论文作者
论文摘要
在线决策问题要求我们根据增量信息做出一系列决策。鉴于上下文信息,常见解决方案通常需要学习不同行动的奖励模型,然后最大程度地提高长期奖励。有意义地知道所提出的模型是否合理以及模型在渐近意义上的表现。我们使用线性奖励模型在上下文匪框架的设置下研究了这个问题。 $ \ varepsilon $ - 绿色政策用于解决经典的探索和探索困境。使用Martingale Central Limit定理,我们表明模型参数的在线普通最小二乘估计器在渐近上是正常的。当线性模型被弄清楚时,我们使用反向倾向得分加权提出在线加权最小二乘估计器,并建立其渐近正态性。根据参数估计器的属性,我们进一步表明,样本中的反向加权值估计量在渐近上是正常的。我们使用模拟和新闻文章建议数据集的应用程序说明了结果。
Online decision-making problem requires us to make a sequence of decisions based on incremental information. Common solutions often need to learn a reward model of different actions given the contextual information and then maximize the long-term reward. It is meaningful to know if the posited model is reasonable and how the model performs in the asymptotic sense. We study this problem under the setup of the contextual bandit framework with a linear reward model. The $\varepsilon$-greedy policy is adopted to address the classic exploration-and-exploitation dilemma. Using the martingale central limit theorem, we show that the online ordinary least squares estimator of model parameters is asymptotically normal. When the linear model is misspecified, we propose the online weighted least squares estimator using the inverse propensity score weighting and also establish its asymptotic normality. Based on the properties of the parameter estimators, we further show that the in-sample inverse propensity weighted value estimator is asymptotically normal. We illustrate our results using simulations and an application to a news article recommendation dataset from Yahoo!.