WORD2VEC跳过尺寸通过顺序归一化最大似然选择

论文标题

WORD2VEC跳过尺寸通过顺序归一化最大似然选择

Word2vec Skip-gram Dimensionality Selection via Sequential Normalized Maximum Likelihood

论文作者

Hung, Pham Thuc, Yamanishi, Kenji

论文摘要

在本文中，我们提出了一种基于信息标准的新方法，以选择Word2Vec Skip-gram（SG）的维度。从概率理论的角度来看，SG被认为是隐含的概率分布估计，假设单词之间存在真实的上下文分布。因此，我们应用信息标准以选择最佳维度，以便相应的模型与真实分布尽可能接近。我们检查了有关维度选择问题的以下信息标准：Akaike信息标准，贝叶斯信息标准和顺序归一化最大似然（SNML）标准。 SNML是基于最小描述长度对数据序列进行顺序编码所需的总代码长度。提出的方法既应用于原始的SG模型和SG负抽样模型，以阐明使用信息标准的想法。此外，由于原始SNML遭受了计算缺点，我们引入了新颖的启发式方法，以进行有效的计算。此外，我们从经验上证明，SNML的表现均优于BIC和AIC。与单词嵌入的其他评估方法相比，SNML选择的维度明显接近通过单词类比或单词相似性任务获得的最佳维度。

In this paper, we propose a novel information criteria-based approach to select the dimensionality of the word2vec Skip-gram (SG). From the perspective of the probability theory, SG is considered as an implicit probability distribution estimation under the assumption that there exists a true contextual distribution among words. Therefore, we apply information criteria with the aim of selecting the best dimensionality so that the corresponding model can be as close as possible to the true distribution. We examine the following information criteria for the dimensionality selection problem: the Akaike Information Criterion, Bayesian Information Criterion, and Sequential Normalized Maximum Likelihood (SNML) criterion. SNML is the total codelength required for the sequential encoding of a data sequence on the basis of the minimum description length. The proposed approach is applied to both the original SG model and the SG Negative Sampling model to clarify the idea of using information criteria. Additionally, as the original SNML suffers from computational disadvantages, we introduce novel heuristics for its efficient computation. Moreover, we empirically demonstrate that SNML outperforms both BIC and AIC. In comparison with other evaluation methods for word embedding, the dimensionality selected by SNML is significantly closer to the optimal dimensionality obtained by word analogy or word similarity tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题