论文标题
在支撑矢量机的核心上
On Coresets for Support Vector Machines
论文作者
论文摘要
我们为大数据和流应用程序中的大规模支持向量机(SVM)培训提供了一种有效的核心结构算法。核心是原始数据点的一个小的代表性子集,以便在核心上训练的模型与对原始数据集训练的模型具有竞争力。由于核心的大小通常比原始集合小得多,因此我们的预处理 - 然后训练方案在训练SVM模型时可能会导致显着加速。我们证明,获得SVM问题的小数据摘要所需的核心尺寸和上限。作为推论,我们表明我们的算法可用于扩展任何现成的SVM求解器对流,分布式和动态数据设置的适用性。我们评估了对现实世界和合成数据集的算法的性能。我们的实验结果重申了算法的有利理论特性,并证明了其在加速SVM培训方面的实际有效性。
We present an efficient coreset construction algorithm for large-scale Support Vector Machine (SVM) training in Big Data and streaming applications. A coreset is a small, representative subset of the original data points such that a models trained on the coreset are provably competitive with those trained on the original data set. Since the size of the coreset is generally much smaller than the original set, our preprocess-then-train scheme has potential to lead to significant speedups when training SVM models. We prove lower and upper bounds on the size of the coreset required to obtain small data summaries for the SVM problem. As a corollary, we show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings. We evaluate the performance of our algorithm on real-world and synthetic data sets. Our experimental results reaffirm the favorable theoretical properties of our algorithm and demonstrate its practical effectiveness in accelerating SVM training.