论文标题
预测未知空间?估计空间预测模型的适用性面积
Predicting into unknown space? Estimating the area of applicability of spatial prediction models
论文作者
论文摘要
使用机器学习的预测性建模已在环境的空间映射中非常流行。通常应用模型来进行预测,远远超出了采样位置,在这些位置,新的地理位置可能与其环境特性中的训练数据有很大差异。但是,预测空间中的区域不支持培训数据是有问题的。由于该模型对这些环境不了解,因此必须认为预测不确定。 需要估计可以可靠地应用预测模型的区域。在这里,我们建议一种描述“适用性区域”(AOA)的方法,我们将其定义为该区域,该区域适用该模型的交叉验证误差。我们首先提出了一个基于预测器空间中训练数据的最小距离的“差异指数”(DI),预测变量因其在模型中的重要性而加权。然后,通过基于训练数据的DI应用阈值来得出AOA,其中根据用于模型训练的交叉验证策略来计算DI。我们通过使用模拟数据来测试理想阈值,并将AOA中的预测误差与模型的交叉验证误差进行比较。我们使用模拟案例研究说明了这种方法。 我们的仿真研究表明,在训练数据中,DI定义DI的AOA的阈值。使用此阈值,AOA内的预测误差与模型的交叉验证RMSE相当,而交叉验证误差在AOA之外不适用。这适用于接受随机分布的训练数据训练的模型,以及训练数据聚集在空间和应用空间交叉验证的何时。 我们建议将AOA与预测一起报告,并与验证措施互补。
Predictive modelling using machine learning has become very popular for spatial mapping of the environment. Models are often applied to make predictions far beyond sampling locations where new geographic locations might considerably differ from the training data in their environmental properties. However, areas in the predictor space without support of training data are problematic. Since the model has no knowledge about these environments, predictions have to be considered uncertain. Estimating the area to which a prediction model can be reliably applied is required. Here, we suggest a methodology that delineates the "area of applicability" (AOA) that we define as the area, for which the cross-validation error of the model applies. We first propose a "dissimilarity index" (DI) that is based on the minimum distance to the training data in the predictor space, with predictors being weighted by their respective importance in the model. The AOA is then derived by applying a threshold based on the DI of the training data where the DI is calculated with respect to the cross-validation strategy used for model training. We test for the ideal threshold by using simulated data and compare the prediction error within the AOA with the cross-validation error of the model. We illustrate the approach using a simulated case study. Our simulation study suggests a threshold on DI to define the AOA at the .95 quantile of the DI in the training data. Using this threshold, the prediction error within the AOA is comparable to the cross-validation RMSE of the model, while the cross-validation error does not apply outside the AOA. This applies to models being trained with randomly distributed training data, as well as when training data are clustered in space and where spatial cross-validation is applied. We suggest to report the AOA alongside predictions, complementary to validation measures.