论文标题

什么时候算?交叉验证之前和期间的插补

When to Impute? Imputation before and during cross-validation

论文作者

Jaeger, Byron C., Tierney, Nicholas J., Simon, Noah R.

论文摘要

交叉验证(CV)是一种用于估计预测模型的概括误差的技术。对于管道建模算法(即具有多个步骤的建模过程),建议在CV的每个复制过程中执行整个步骤序列,以模仿整个管道在外部测试集中的应用。虽然从理论上讲是合理的,但遵循此建议可能会导致高计算成本,而管道建模算法包括计算昂贵的操作,例如丢失价值的插补。人们普遍认为,无监督的变量选择(即忽略结果)可以在进行简历而不会产生偏见之前应用,但是对于丢失值的无监督插补的共识较少。我们通过经验评估了在简历之前进行无监督的插补是否会导致概括误差的估计值或导致选择不佳的调谐参数,从而降低下游模型的外部性能。结果表明,尽管具有乐观的偏见,但在每次复制的CV中,CV前的插定差异降低,导致估计真实外部R平方的总体平均误差较低,以估算真实的外部R平方,并且在每个复制过程中使用CV在每次复制过程中使用CV调节的模型的性能是最小的不同。总之,在某些设置中,在简历之前无监督的插补在某些情况下似乎有效,并且可能是一种有用的策略,使分析师能够使用更灵活的插补技术而不会产生高计算成本。

Cross-validation (CV) is a technique used to estimate generalization error for prediction models. For pipeline modeling algorithms (i.e. modeling procedures with multiple steps), it has been recommended the entire sequence of steps be carried out during each replicate of CV to mimic the application of the entire pipeline to an external testing set. While theoretically sound, following this recommendation can lead to high computational costs when a pipeline modeling algorithm includes computationally expensive operations, e.g. imputation of missing values. There is a general belief that unsupervised variable selection (i.e. ignoring the outcome) can be applied before conducting CV without incurring bias, but there is less consensus for unsupervised imputation of missing values. We empirically assessed whether conducting unsupervised imputation prior to CV would result in biased estimates of generalization error or result in poorly selected tuning parameters and thus degrade the external performance of downstream models. Results show that despite optimistic bias, the reduced variance of imputation before CV compared to imputation during each replicate of CV leads to a lower overall root mean squared error for estimation of the true external R-squared and the performance of models tuned using CV with imputation before versus during each replication is minimally different. In conclusion, unsupervised imputation before CV appears valid in certain settings and may be a helpful strategy that enables analysts to use more flexible imputation techniques without incurring high computational costs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源