是什么使一个好渔夫？在自选择性偏见下的线性回归

论文标题

是什么使一个好渔夫？在自选择性偏见下的线性回归

What Makes A Good Fisherman? Linear Regression under Self-Selection Bias

论文作者

Cherapanamjeri, Yeshwanth, Daskalakis, Constantinos, Ilyas, Andrew, Zampetakis, Manolis

论文摘要

在自我选择的经典环境中，目标是从观察值$（x^{（i）}，y^{（i）}）$中学习$ k $模型，其中$ y^{（i）} $是输入$ x^{（i）} $的$ k $ sustoming型号的输出。与混合模型相比，我们观察到随机选择模型的输出，此处观察到的模型取决于输出本身，并且由一些已知的选择标准确定。例如，我们可能会观察到$ K $型号的最高输出，最小的输出或中位输出。在已知的索引自我选择中，可以观察到观察到的模型输出的身份。在未知的索引自我选择中，事实并非如此。自我选择在各种理论和应用领域的计量经济学和应用中具有悠久的历史，包括治疗效果估计，模仿学习，从战略上报告的数据中学习以及在不平衡的市场中学习。在这项工作中，我们在模型是线性的该问题的最标准设置上介绍了第一个计算和统计上有效的估计算法。在已知的索引情况下，我们需要poly $（1/\ varepsilon，k，d）$样本和时间复杂性，以估算所有模型参数以准确$ \ varepsilon $ in $ d $ dimensions，并且可以适应相当一般的选择标准。在更具挑战性的未知索引情况下，即使是线性模型的可识别性（来自无限的样本）也不知道。 We show three results in this case for the commonly studied $\max$ self-selection criterion: (1) we show that the linear models are indeed identifiable, (2) for general $k$ we provide an algorithm with poly$(d) \exp(\text{poly}(k))$ sample and time complexity to estimate the regression parameters up to error $1/\text{poly}(k)$, and (3)对于$ k = 2 $，我们为任何错误$ \ varepsilon $和poly $（d，1/\ varepsilon）$样本和时间复杂性提供算法。

In the classical setting of self-selection, the goal is to learn $k$ models, simultaneously from observations $(x^{(i)}, y^{(i)})$ where $y^{(i)}$ is the output of one of $k$ underlying models on input $x^{(i)}$. In contrast to mixture models, where we observe the output of a randomly selected model, here the observed model depends on the outputs themselves, and is determined by some known selection criterion. For example, we might observe the highest output, the smallest output, or the median output of the $k$ models. In known-index self-selection, the identity of the observed model output is observable; in unknown-index self-selection, it is not. Self-selection has a long history in Econometrics and applications in various theoretical and applied fields, including treatment effect estimation, imitation learning, learning from strategically reported data, and learning from markets at disequilibrium. In this work, we present the first computationally and statistically efficient estimation algorithms for the most standard setting of this problem where the models are linear. In the known-index case, we require poly$(1/\varepsilon, k, d)$ sample and time complexity to estimate all model parameters to accuracy $\varepsilon$ in $d$ dimensions, and can accommodate quite general selection criteria. In the more challenging unknown-index case, even the identifiability of the linear models (from infinitely many samples) was not known. We show three results in this case for the commonly studied $\max$ self-selection criterion: (1) we show that the linear models are indeed identifiable, (2) for general $k$ we provide an algorithm with poly$(d) \exp(\text{poly}(k))$ sample and time complexity to estimate the regression parameters up to error $1/\text{poly}(k)$, and (3) for $k = 2$ we provide an algorithm for any error $\varepsilon$ and poly$(d, 1/\varepsilon)$ sample and time complexity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题