你不能挑选邻居，还是可以？ $ k $ nn-lm中的何时以及如何依靠检索

论文标题

你不能挑选邻居，还是可以？ $ k $ nn-lm中的何时以及如何依靠检索

You can't pick your neighbors, or can you? When and how to rely on retrieval in the $k$NN-LM

论文作者

Drozdov, Andrew, Wang, Shufan, Rahimi, Razieh, McCallum, Andrew, Zamani, Hamed, Iyyer, Mohit

论文摘要

与标准LMS相比，检索增强语言模型（LMS）在从大型外部数据存储中检索到的文本进行了预测，该模型在大型外部数据存储中的文本进行了显着改善。一种这样的方法，即$ k $ nn-lm，可以通过$ k $ neart最邻居模型的输出来插入任何现有LM的预测，并且不需要额外的培训。在本文中，我们探讨了在$ k $ nn-lm检索的项目的背景下，词汇和语义匹配的重要性。我们发现两个趋势：（1）数据存储和评估集之间存在大型重叠的$ n $ gram，即使数据存储是从培训数据中得出的，也是强大性能的重要因素；（2）$ k $ nn-lm当检索物品与查询具有很高的语义相似性时，最有益。根据我们的分析，我们定义了使用检索质量分配插值系数的$ K $ nn-LM的新配方。我们从经验上衡量方法对两个英语模型数据集Wikitext-103和PG-19的有效性。在这两种情况下，我们对$ k $ nn-lm的重新制定都是有益的，并且在Wikitext-103测试集上的困惑性增长了近4％。

Retrieval-enhanced language models (LMs), which condition their predictions on text retrieved from large external datastores, have recently shown significant perplexity improvements compared to standard LMs. One such approach, the $k$NN-LM, interpolates any existing LM's predictions with the output of a $k$-nearest neighbors model and requires no additional training. In this paper, we explore the importance of lexical and semantic matching in the context of items retrieved by $k$NN-LM. We find two trends: (1) the presence of large overlapping $n$-grams between the datastore and evaluation set plays an important factor in strong performance, even when the datastore is derived from the training data; and (2) the $k$NN-LM is most beneficial when retrieved items have high semantic similarity with the query. Based on our analysis, we define a new formulation of the $k$NN-LM that uses retrieval quality to assign the interpolation coefficient. We empirically measure the effectiveness of our approach on two English language modeling datasets, Wikitext-103 and PG-19. Our re-formulation of the $k$NN-LM is beneficial in both cases, and leads to nearly 4% improvement in perplexity on the Wikitext-103 test set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题