选择$ k $ in $ k $ -nn回归的最低差异主要策略

论文标题

选择$ k $ in $ k $ -nn回归的最低差异主要策略

Minimum discrepancy principle strategy for choosing $k$ in $k$-NN regression

论文作者

Averyanov, Yaroslav, Celisse, Alain

论文摘要

我们提出了一种新颖的数据驱动策略，可以在$ K $ -NN回归估算器中选择超参数$ k $，而无需使用任何保持数据。我们将选择高参数作为迭代程序（超过$ k $）的问题将其视为基于早期停止和最小差异原则的实践策略中易于实施的问题。该模型选择策略被证明是在某些平滑度函数类别上的最小值，例如，Lipschitz功能类别上有界域上的函数类。与其他模型选择策略相比，这种新方法通常会改善人工和现实世界数据集的统计性能，例如Hold-Out方法，5倍的交叉验证和AIC标准。该策略的新颖性来自减少模型选择程序的计算时间，同时保留所得估计器的统计（minimax）最佳性。更确切地说，如果要在$ \ weft \ {1，\ ldots，n \ right \} $中选择$ k $的样本，以及$ \ weft \ {f^1，\ ldots，f^n \ right \} $是回归函数的估计值，那么概述的是，这是一定的，for的估计值是a的估计值广义的交叉验证，Akaike的AIC标准或Lepskii原则。

We present a novel data-driven strategy to choose the hyperparameter $k$ in the $k$-NN regression estimator without using any hold-out data. We treat the problem of choosing the hyperparameter as an iterative procedure (over $k$) and propose using an easily implemented in practice strategy based on the idea of early stopping and the minimum discrepancy principle. This model selection strategy is proven to be minimax-optimal over some smoothness function classes, for instance, the Lipschitz functions class on a bounded domain. The novel method often improves statistical performance on artificial and real-world data sets in comparison to other model selection strategies, such as the Hold-out method, 5-fold cross-validation, and AIC criterion. The novelty of the strategy comes from reducing the computational time of the model selection procedure while preserving the statistical (minimax) optimality of the resulting estimator. More precisely, given a sample of size $n$, if one should choose $k$ among $\left\{ 1, \ldots, n \right\}$, and $\left\{ f^1, \ldots, f^n \right\}$ are the estimators of the regression function, the minimum discrepancy principle requires the calculation of a fraction of the estimators, while this is not the case for the generalized cross-validation, Akaike's AIC criteria, or Lepskii principle.

下载PDF全文

下载文献需遵守相关版权规定

论文标题