论文标题
基于统一设计的无模型亚采样方法
Model-free Subsampling Method Based on Uniform Designs
论文作者
论文摘要
在大规模统计学习中,子采样或子数据选择是一种有用的方法。大多数现有研究的重点是基于模型的亚采样方法,这些方法显着取决于模型假设。在本文中,我们考虑了从原始完整数据中生成子数据的无模型亚采样策略。为了衡量subdata在原始数据方面的表示的好处,我们提出了一个标准,广义的经验F-歧义(GEFD),并研究了其与统一设计理论中经典的广义L2票有关的理论特性。这些属性使我们能够根据现有统一设计开发一种低gefd数据驱动的子采样方法。通过仿真示例和一个真实的案例研究,我们表明所提出的子采样方法优于随机抽样方法。此外,我们的方法在不同的模型规范下保持稳健,而其他流行的亚采样方法的表现不佳。实际上,这种无模型的属性比基于模型的亚采样方法更具吸引力,在我们的仿真研究中所证明的那样,后者的性能可能较差。
Subsampling or subdata selection is a useful approach in large-scale statistical learning. Most existing studies focus on model-based subsampling methods which significantly depend on the model assumption. In this paper, we consider the model-free subsampling strategy for generating subdata from the original full data. In order to measure the goodness of representation of a subdata with respect to the original data, we propose a criterion, generalized empirical F-discrepancy (GEFD), and study its theoretical properties in connection with the classical generalized L2-discrepancy in the theory of uniform designs. These properties allow us to develop a kind of low-GEFD data-driven subsampling method based on the existing uniform designs. By simulation examples and a real case study, we show that the proposed subsampling method is superior to the random sampling method. Moreover, our method keeps robust under diverse model specifications while other popular subsampling methods are under-performing. In practice, such a model-free property is more appealing than the model-based subsampling methods, where the latter may have poor performance when the model is misspecified, as demonstrated in our simulation studies.