论文标题
通过蒸馏快速而有力的有条件随机测试
Fast and Powerful Conditional Randomization Testing via Distillation
论文作者
论文摘要
We consider the problem of conditional independence testing: given a response Y and covariates (X,Z), we test the null hypothesis that Y is independent of X given Z. The conditional randomization test (CRT) was recently proposed as a way to use distributional information about X|Z to exactly (non-asymptotically) control Type-I error using any test statistic in any dimensionality without assuming anything about Y|(X,Z).从原则上讲,这种灵活性使人们可以从复杂的预测算法中得出强大的测试统计数据,同时保持统计有效性。然而,由于CRT要求在重新采样的数据上多次重新计算测试统计数据,因此在CRT中直接使用此类高级测试统计量在计算上非常昂贵,尤其是在多次测试中。我们提出了蒸馏CRT,这是一种新颖的方法,用于使用CRT中最先进的机器学习算法,同时大大减少需要运行这些算法的次数,从而利用其功率和CRT的统计保证,而不会遭受通常的计算费用。除蒸馏外,我们还提出了许多其他技巧,例如筛选和回收计算,以进一步加快CRT的速度,而无需牺牲其高功率和精确的有效性。确实,我们在模拟中表明,我们所有的提案结合起来都导致了与最强大的现有CRT实现相似的测试,但需要减少计算订单,即使对于大型数据集,也使其成为实用工具。我们通过识别与癌症阶段有关的生物标志物来证明这些益处在乳腺癌数据集上。
We consider the problem of conditional independence testing: given a response Y and covariates (X,Z), we test the null hypothesis that Y is independent of X given Z. The conditional randomization test (CRT) was recently proposed as a way to use distributional information about X|Z to exactly (non-asymptotically) control Type-I error using any test statistic in any dimensionality without assuming anything about Y|(X,Z). This flexibility in principle allows one to derive powerful test statistics from complex prediction algorithms while maintaining statistical validity. Yet the direct use of such advanced test statistics in the CRT is prohibitively computationally expensive, especially with multiple testing, due to the CRT's requirement to recompute the test statistic many times on resampled data. We propose the distilled CRT, a novel approach to using state-of-the-art machine learning algorithms in the CRT while drastically reducing the number of times those algorithms need to be run, thereby taking advantage of their power and the CRT's statistical guarantees without suffering the usual computational expense. In addition to distillation, we propose a number of other tricks like screening and recycling computations to further speed up the CRT without sacrificing its high power and exact validity. Indeed, we show in simulations that all our proposals combined lead to a test that has similar power to the most powerful existing CRT implementations but requires orders of magnitude less computation, making it a practical tool even for large data sets. We demonstrate these benefits on a breast cancer dataset by identifying biomarkers related to cancer stage.