论文标题
Dartminhash:加权集的快速素描
DartMinHash: Fast Sketching for Weighted Sets
论文作者
论文摘要
加权的Minwise Hashing是一种标准维度降低技术,具有相似性搜索和大规模内核机器的应用。我们介绍了一种简单的算法,该算法在\ mathbb {r} _ {\ geq 0}^{d} $中采用加权集$ x \ in \ mathbb {r} _ {\ geq 0}^{d} $,并计算$ k $独立的minhashes in Exirpion $ o(k \ log k + k + k + k + \ vert x \ vert x \ vert x \ vert_ {0} \ log(\ vert_ \ vert x \ vert x \ vert x + vert x + vert x + vert x + vert x + vert \ vert_1)$,在最先进的Bagminhash算法(KDD '18)上改进,并代表最快的加权Minhash算法,用于稀疏数据。我们的实验表明,与ICWS(ICDM '10)和Bagminhash相比,使用$ K $和$ \ vert X \ Vert_0 $的运行时间更好,并在常见用例中获得$ 10 $ X的加速。我们的方法还产生了一种技术,用于计算$(l,k)$的完全独立的区域敏感的哈希值,即在加权jaccard相似性下的参数化近似近似近邻居搜索,即最佳预期时间$ o(lk + \ vert x \ vert_0)$,即使在未级别的工作中也可以改善先前的工作。
Weighted minwise hashing is a standard dimensionality reduction technique with applications to similarity search and large-scale kernel machines. We introduce a simple algorithm that takes a weighted set $x \in \mathbb{R}_{\geq 0}^{d}$ and computes $k$ independent minhashes in expected time $O(k \log k + \Vert x \Vert_{0}\log( \Vert x \Vert_1 + 1/\Vert x \Vert_1))$, improving upon the state-of-the-art BagMinHash algorithm (KDD '18) and representing the fastest weighted minhash algorithm for sparse data. Our experiments show running times that scale better with $k$ and $\Vert x \Vert_0$ compared to ICWS (ICDM '10) and BagMinhash, obtaining $10$x speedups in common use cases. Our approach also gives rise to a technique for computing fully independent locality-sensitive hash values for $(L, K)$-parameterized approximate near neighbor search under weighted Jaccard similarity in optimal expected time $O(LK + \Vert x \Vert_0)$, improving on prior work even in the case of unweighted sets.