从偏见的样本中学习

论文标题

从偏见的样本中学习

Learning from a Biased Sample

论文作者

Sahoo, Roshni, Lei, Lihua, Wager, Stefan

论文摘要

数据驱动决策的经验风险最小化方法需要在与部署决策规则时面临的条件相同的条件下访问培训数据。但是，在许多设置中，我们可能会担心我们的培训样本是有偏见的，因为某些组（以可观察或无法观察到的属性为特征）可能相对于一般人群而言可能是不足或代表性过多的。在这种环境中，对培训集的经验风险最小化可能无法产生在部署时表现良好的规则。我们提出了一种称为条件$γ$偏见的采样模型，在观察到的协变量可以任意影响样本选择的概率，但样品选择概率中无法解释的变化的量受恒定因子的限制。应用分配强大的优化框架，我们提出了一种学习决策规则的方法，该方法将在测试分布家族的家庭中最大程度地减少了最坏情况，该案例可以在$γ$偏见的下产生培训分布。我们应用Rockafellar和Uryasev的结果表明，此问题等同于增强的凸风险最小化问题。我们为学习一个模型提供了统计保证，该模型可以通过筛子的方法来取样偏差，并提出了一种深度学习算法，其损失功能捕获了我们强大的学习目标。我们在一项关于健康调查数据中心理健康评分预测的案例研究中验证了我们提出的方法，以及关于ICU住院时间预测时间的案例研究。

The empirical risk minimization approach to data-driven decision making requires access to training data drawn under the same conditions as those that will be faced when the decision rule is deployed. However, in a number of settings, we may be concerned that our training sample is biased in the sense that some groups (characterized by either observable or unobservable attributes) may be under- or over-represented relative to the general population; and in this setting empirical risk minimization over the training set may fail to yield rules that perform well at deployment. We propose a model of sampling bias called conditional $Γ$-biased sampling, where observed covariates can affect the probability of sample selection arbitrarily much but the amount of unexplained variation in the probability of sample selection is bounded by a constant factor. Applying the distributionally robust optimization framework, we propose a method for learning a decision rule that minimizes the worst-case risk incurred under a family of test distributions that can generate the training distribution under $Γ$-biased sampling. We apply a result of Rockafellar and Uryasev to show that this problem is equivalent to an augmented convex risk minimization problem. We give statistical guarantees for learning a model that is robust to sampling bias via the method of sieves, and propose a deep learning algorithm whose loss function captures our robust learning target. We empirically validate our proposed method in a case study on prediction of mental health scores from health survey data and a case study on ICU length of stay prediction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题