因果推断，具有选择性解浓度的数据

论文标题

因果推断，具有选择性解浓度的数据

Causal Inference With Selectively Deconfounded Data

论文作者

Gan, Kyra, Li, Andrew A., Lipton, Zachary C., Tayur, Sridhar

论文摘要

只有标准混杂图与未观察到的混杂因素产生的数据，平均治疗效果（ATE）是无法识别的。为了估计吃饭，从业人员必须要么（a）收集变形的数据；（b）进行临床试验；（c）阐明可能使Ate可识别的因果图的进一步特性。在本文中，我们考虑将大型混杂的观测数据集（混杂因素未观察到）与一个小小的变形的观测数据集（揭示了混杂器）一起融合到估计ate时的好处。我们的理论结果表明，包含混杂的数据可以显着减少将ATE估算到所需精度级别所需的变形数据的数量。此外，在某些情况下 - 例如遗传学 - 我们可以想象回顾性地选择样品以解污染。我们证明，通过基于（已经观察到的）处理和结果积极选择这些样品，我们可以进一步降低样本复杂性。我们的理论和经验结果表明，我们的方法最差的相对性能（与自然基准）是有限的，而我们的最佳案例收益是无限的。最后，我们证明了使用与癌症基因突变有关的大型现实世界数据集的选择性解体的好处。

Given only data generated by a standard confounding graph with unobserved confounder, the Average Treatment Effect (ATE) is not identifiable. To estimate the ATE, a practitioner must then either (a) collect deconfounded data;(b) run a clinical trial; or (c) elucidate further properties of the causal graph that might render the ATE identifiable. In this paper, we consider the benefit of incorporating a large confounded observational dataset (confounder unobserved) alongside a small deconfounded observational dataset (confounder revealed) when estimating the ATE. Our theoretical results suggest that the inclusion of confounded data can significantly reduce the quantity of deconfounded data required to estimate the ATE to within a desired accuracy level. Moreover, in some cases -- say, genetics -- we could imagine retrospectively selecting samples to deconfound. We demonstrate that by actively selecting these samples based upon the (already observed) treatment and outcome, we can reduce sample complexity further. Our theoretical and empirical results establish that the worst-case relative performance of our approach (vs. a natural benchmark) is bounded while our best-case gains are unbounded. Finally, we demonstrate the benefits of selective deconfounding using a large real-world dataset related to genetic mutation in cancer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题