论文标题
交叉验证的渐近学
Asymptotics of Cross-Validation
论文作者
论文摘要
交叉验证是评估机器学习和统计模型的性能的核心工具。然而,尽管其无处不在的作用,但其理论特性仍未得到充分理解。我们研究了大类模型的交叉验证风险的渐近性能。在稳定条件下,我们建立了一个中心限制定理和浆果 - 埃森界限,这使我们能够计算渐近准确的置信区间。使用我们的结果,与火车测试拆分程序相比,我们为交叉验证的统计加快绘制了全局。我们的结果的必然性是,在训练损失下执行交叉验证时,参数M-估计器(或经验风险最小化)受益于“完全”加速。在其他常见情况下,例如使用替代损失或正规化器进行训练时,我们表明交叉验证风险的行为是复杂的,而降低差异可能比“完全”加速的差异,取决于模型和基础分布。我们允许随着观测值的数量而增长的倍数。
Cross validation is a central tool in evaluating the performance of machine learning and statistical models. However, despite its ubiquitous role, its theoretical properties are still not well understood. We study the asymptotic properties of the cross validated-risk for a large class of models. Under stability conditions, we establish a central limit theorem and Berry-Esseen bounds, which enable us to compute asymptotically accurate confidence intervals. Using our results, we paint a big picture for the statistical speed-up of cross validation compared to a train-test split procedure. A corollary of our results is that parametric M-estimators (or empirical risk minimizers) benefit from the "full" speed-up when performing cross-validation under the training loss. In other common cases, such as when the training is done using a surrogate loss or a regularizer, we show that the behavior of the cross-validated risk is complex with a variance reduction which may be smaller or larger than the "full" speed-up, depending on the model and the underlying distribution. We allow the number of folds to grow with the number of observations at any rate.