论文标题
主词向量
Principal Word Vectors
论文作者
论文摘要
我们将主成分分析概括为将单词嵌入向量空间中。概括分为两个主要层次。首先是将语料库的概念概括为一个计数过程,该过程由三个关键元素词汇集,功能(注释)集和上下文定义。该概括使主要单词嵌入方法可以生成有关不同类型的上下文和为语料库提供的不同类型的注释的单词向量。第二个是概括大多数单词嵌入方法中使用的转换步骤。为此,我们定义了两个转换级别。第一个是二次变换,它说明了词汇单元和上下文特征的不同类型的加权类型。第二是自适应的非线性转换,它重塑了数据分布对主成分分析有意义。这些概括对矢量一词的影响是关于矢量一词的传播和可区分性的本质研究。我们还提供了对单词相似性基准和依赖解析任务的主要单词向量的贡献的外部评估。我们的实验通过主要单词矢量与使用流行单词嵌入方法生成的其他单词向量之间的比较来完成。从我们的内在评估指标中获得的结果表明,主单词向量的传播和可区分性高于其他单词嵌入方法的传播和可区分性。从外部评估指标中获得的结果表明,主要单词向量比某些单词嵌入方法更好,并且与流行的单词嵌入方法相提并论。
We generalize principal component analysis for embedding words into a vector space. The generalization is made in two major levels. The first is to generalize the concept of the corpus as a counting process which is defined by three key elements vocabulary set, feature (annotation) set, and context. This generalization enables the principal word embedding method to generate word vectors with regard to different types of contexts and different types of annotations provided for a corpus. The second is to generalize the transformation step used in most of the word embedding methods. To this end, we define two levels of transformations. The first is a quadratic transformation, which accounts for different types of weighting over the vocabulary units and contextual features. Second is an adaptive non-linear transformation, which reshapes the data distribution to be meaningful to principal component analysis. The effect of these generalizations on the word vectors is intrinsically studied with regard to the spread and the discriminability of the word vectors. We also provide an extrinsic evaluation of the contribution of the principal word vectors on a word similarity benchmark and the task of dependency parsing. Our experiments are finalized by a comparison between the principal word vectors and other sets of word vectors generated with popular word embedding methods. The results obtained from our intrinsic evaluation metrics show that the spread and the discriminability of the principal word vectors are higher than that of other word embedding methods. The results obtained from the extrinsic evaluation metrics show that the principal word vectors are better than some of the word embedding methods and on par with popular methods of word embedding.