将NLP技术从分类扩展到潜在空间：KL Divergence，ZIPF定律和相似性搜索

论文标题

将NLP技术从分类扩展到潜在空间：KL Divergence，ZIPF定律和相似性搜索

On Extending NLP Techniques from the Categorical to the Latent Space: KL Divergence, Zipf's Law, and Similarity Search

论文作者

Hare, Adam, Chen, Yu, Liu, Yinan, Liu, Zhenming, Brinton, Christopher G.

论文摘要

尽管最新学习在自然语言处理（NLP）中取得了成功，但仍然广泛使用不依赖机器学习的技术。这些技术的优点是与经常不透明和昂贵的机器学习模型相比，它们的可解释性和低成本。尽管它们在所有情况下都不是表现的，但它们通常足以解决常见和相对简单的问题。在本文中，我们旨在使这些较旧的方法现代化，同时通过将方法从分类或单词袋的表示形式扩展到潜在空间中的单词嵌入式表示形式。首先，我们表明熵和kullback-leibler差异可以使用单词嵌入有效地估计，并使用此估计来比较几个类别的文本。接下来，我们重铸了称为ZIPF定律的重尾分布，该分布经常在潜在空间的分类空间中观察到。最后，我们希望通过引入一种基于设定覆盖问题的新句子来识别相似句子的新方法来改善句子建议的相似性度量。我们将该算法的性能与几个基线的性能进行了比较，包括单词Mover的距离和Levenshtein距离。

Despite the recent successes of deep learning in natural language processing (NLP), there remains widespread usage of and demand for techniques that do not rely on machine learning. The advantage of these techniques is their interpretability and low cost when compared to frequently opaque and expensive machine learning models. Although they may not be be as performant in all cases, they are often sufficient for common and relatively simple problems. In this paper, we aim to modernize these older methods while retaining their advantages by extending approaches from categorical or bag-of-words representations to word embeddings representations in the latent space. First, we show that entropy and Kullback-Leibler divergence can be efficiently estimated using word embeddings and use this estimation to compare text across several categories. Next, we recast the heavy-tailed distribution known as Zipf's law that is frequently observed in the categorical space to the latent space. Finally, we look to improve the Jaccard similarity measure for sentence suggestion by introducing a new method of identifying similar sentences based on the set cover problem. We compare the performance of this algorithm against several baselines including Word Mover's Distance and the Levenshtein distance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题