论文标题

太阳:稀疏的正交学习和随机嵌入

SOLAR: Sparse Orthogonal Learned and Random Embeddings

论文作者

Medini, Tharun, Chen, Beidi, Shrivastava, Anshumali

论文摘要

密集的嵌入模型通常部署在商业搜索引擎中,其中所有文档向量均已预先计算,并使用查询向量进行近邻居搜索(NNS)以查找相关文档。但是,索引大量密集的向量和执行NNS的瓶颈损害了这些模型的查询时间和准确性。在本文中,我们认为,高维和超高的嵌入是对查询效率和准确性的密集低维嵌入的优势替代品。极端的稀疏性通过用简单的查找代替NN来消除对NN的需求,而其高维度可以确保即使稀疏,嵌入也很有帮助。但是,学习极高的尺寸嵌入会导致模型尺寸爆炸。为了使培训可行,我们提出了一种分区算法,该算法在没有任何通信的情况下学习了多个GPU的高维嵌入。稀疏,正交,学习和随机(太阳能)嵌入的新型不对称混合物的新型不对称混合物促进了这一点。标签向量是随机的,稀疏的,并且按设计近乎正交,而查询向量则是稀疏的。从理论上讲,我们的单方学习方式相当于学习查询和标签嵌入。借助这些独特的属性,我们可以成功地训练500k维度太阳能嵌入,以在三个最大的公共数据集中通过160万本书和多标签分类进行搜索的任务。与每个任务的最高速度最多10倍的任务相比,我们获得了卓越的精度和回忆。

Dense embedding models are commonly deployed in commercial search engines, wherein all the document vectors are pre-computed, and near-neighbor search (NNS) is performed with the query vector to find relevant documents. However, the bottleneck of indexing a large number of dense vectors and performing an NNS hurts the query time and accuracy of these models. In this paper, we argue that high-dimensional and ultra-sparse embedding is a significantly superior alternative to dense low-dimensional embedding for both query efficiency and accuracy. Extreme sparsity eliminates the need for NNS by replacing them with simple lookups, while its high dimensionality ensures that the embeddings are informative even when sparse. However, learning extremely high dimensional embeddings leads to blow up in the model size. To make the training feasible, we propose a partitioning algorithm that learns such high dimensional embeddings across multiple GPUs without any communication. This is facilitated by our novel asymmetric mixture of Sparse, Orthogonal, Learned and Random (SOLAR) Embeddings. The label vectors are random, sparse, and near-orthogonal by design, while the query vectors are learned and sparse. We theoretically prove that our way of one-sided learning is equivalent to learning both query and label embeddings. With these unique properties, we can successfully train 500K dimensional SOLAR embeddings for the tasks of searching through 1.6M books and multi-label classification on the three largest public datasets. We achieve superior precision and recall compared to the respective state-of-the-art baselines for each of the tasks with up to 10 times faster speed.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源