多样性子采样：大型数据集的自定义子样本

论文标题

多样性子采样：大型数据集的自定义子样本

Diversity Subsampling: Custom Subsamples from Large Data Sets

论文作者

Shang, Boyang, Apley, Daniel W., Mehrotra, Sanjay

论文摘要

来自大数据集的亚采样在许多监督的学习环境中很有用，可以根据观测值的一小部分提供对数据的全局视图。当没有可用数据的先验知识时，多样化（或空间填充）亚采样是一种有吸引力的亚采样方法。在本文中，我们提出了一种多样性的亚采样方法，该方法从原始数据中选择一个子样本，以使子样本在最大程度上绘制数据的分布的支持下独立且均匀地分布。我们给出了所提出方法的渐近性能保证，并提供了实验结果，以表明所提出的方法对于典型的有限尺寸数据表现良好。我们还将所提出的方法与竞争性多样性亚采样算法进行了比较，并以数值的方式证明，该方法由所提出的方法选择的子样本比其他方法选择的子样本更接近均匀的样本。所提出的DS算法比已知的方法更有效，只需几分钟即可从一百万个尺寸的数据集中选择数万个子样本点。我们的DS算法很容易概括以选择以外的分布之外的子样本。我们提供FADS Python软件包来实现所提出的方法。

Subsampling from a large data set is useful in many supervised learning contexts to provide a global view of the data based on only a fraction of the observations. Diverse (or space-filling) subsampling is an appealing subsampling approach when no prior knowledge of the data is available. In this paper, we propose a diversity subsampling approach that selects a subsample from the original data such that the subsample is independently and uniformly distributed over the support of distribution from which the data are drawn, to the maximum extent possible. We give an asymptotic performance guarantee of the proposed method and provide experimental results to show that the proposed method performs well for typical finite-size data. We also compare the proposed method with competing diversity subsampling algorithms and demonstrate numerically that subsamples selected by the proposed method are closer to a uniform sample than subsamples selected by other methods. The proposed DS algorithm is shown to be more efficient than known methods and takes only a few minutes to select tens of thousands of subsample points from a data set of size one million. Our DS algorithm easily generalizes to select subsamples following distributions other than uniform. We provide the FADS Python package to implement the proposed methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题