论文标题
通过K-均值聚类有效而稀疏的计数 - 佐剂
Effective and Sparse Count-Sketch via k-means clustering
论文作者
论文摘要
Count-Sketch是一种流行的矩阵草图算法,它可以在O(nnz(x))时间中产生输入数据矩阵X的草图,其中NNZ(x)表示X中的非零条目的数量。绘制的矩阵将比x x小得多,同时保留其大部分属性。因此,计数 - 凯奇被广泛用于应对机器学习中的高维度挑战。但是,计数 - 勾钉有两个主要局限性:(1)素描矩阵所用的count-ketch是随机生成的,这不考虑X的任何内在数据属性。这种合理的数据构成矩阵素描方法可能会产生不良的素描矩阵,这将导致较低的精确度对于后续的机器学习任务(例如,cellassification); (2)对于高度稀疏的输入数据,计数 - 佐剂可以产生密集的草图数据矩阵。这个密集的草图矩阵可以使后续的机器学习任务在计算上比原始稀疏数据x更昂贵。要解决这两个限制,我们首先通过分析Count-Sketch方法的重建误差来显示Count-Sketch和K-Means聚类之间的有趣连接。基于我们的分析,我们建议通过使用K-均值聚类算法来获得低维素描矩阵来减少计数 - 熟孔的重建误差。此外,我们建议使用带有-l1球投影的梯度下降来求解K-均值聚类,以产生稀疏的素描矩阵。我们基于六个现实生活分类数据集的实验结果表明,我们所提出的方法的准确性比原始的计数 - 凯奇和其他流行的矩阵素描算法更高。我们的结果还表明,我们的方法比其他方法产生一个更稀疏的草图数据矩阵,因此我们方法的预测成本将比其他矩阵素描方法小。
Count-sketch is a popular matrix sketching algorithm that can produce a sketch of an input data matrix X in O(nnz(X))time where nnz(X) denotes the number of non-zero entries in X. The sketched matrix will be much smaller than X while preserving most of its properties. Therefore, count-sketch is widely used for addressing high-dimensionality challenge in machine learning. However, there are two main limitations of count-sketch: (1) The sketching matrix used count-sketch is generated randomly which does not consider any intrinsic data properties of X. This data-oblivious matrix sketching method could produce a bad sketched matrix which will result in low accuracy for subsequent machine learning tasks (e.g.classification); (2) For highly sparse input data, count-sketch could produce a dense sketched data matrix. This dense sketch matrix could make the subsequent machine learning tasks more computationally expensive than on the original sparse data X. To address these two limitations, we first show an interesting connection between count-sketch and k-means clustering by analyzing the reconstruction error of the count-sketch method. Based on our analysis, we propose to reduce the reconstruction error of count-sketch by using k-means clustering algorithm to obtain the low-dimensional sketched matrix. In addition, we propose to solve k-mean clustering using gradient descent with -L1 ball projection to produce a sparse sketched matrix. Our experimental results based on six real-life classification datasets have demonstrated that our proposed method achieves higher accuracy than the original count-sketch and other popular matrix sketching algorithms. Our results also demonstrate that our method produces a sparser sketched data matrix than other methods and therefore the prediction cost of our method will be smaller than other matrix sketching methods.