双词嵌入空间模型的内在分析

论文标题

双词嵌入空间模型的内在分析

Intrinsic analysis for dual word embedding space models

论文作者

Mayank, Mohit

论文摘要

最近的单词嵌入技术表示在连续的向量空间中的单词，从过去的原子和稀疏表示中移动。每种这样的技术都可以基于嵌入尺寸大小，上下文窗口大小和训练方法等不同参数的不同设置来进一步创建多种嵌入。当我们特别考虑双重嵌入空间技术时，它会出现另一种变化，而双嵌入空间技术不是产生一个而是两个单词的嵌入作为输出。这引起了一个有趣的问题 - “两个单词嵌入的品种的一个或一个组合，它适用于特定任务？”。本文试图通过考虑所有这些变化来回答这个问题。本文中，我们比较了属于两种不同方法的两种经典嵌入方法 - 基于窗口的Word2Vec和基于计数的手套。为了在考虑所有变化后进行广泛的评估，将共84个不同的模型与语义，关联和类比评估任务进行了比较，这些任务由9个开源语言数据集组成。最终Word2Vec报告展示了非默认模型在3个任务中的2个。在手套的情况下，非默认模型在所有3个评估任务中都优于球。

Recent word embeddings techniques represent words in a continuous vector space, moving away from the atomic and sparse representations of the past. Each such technique can further create multiple varieties of embeddings based on different settings of hyper-parameters like embedding dimension size, context window size and training method. One additional variety appears when we especially consider the Dual embedding space techniques which generate not one but two-word embeddings as output. This gives rise to an interesting question - "is there one or a combination of the two word embeddings variety, which works better for a specific task?". This paper tries to answer this question by considering all of these variations. Herein, we compare two classical embedding methods belonging to two different methodologies - Word2Vec from window-based and Glove from count-based. For an extensive evaluation after considering all variations, a total of 84 different models were compared against semantic, association and analogy evaluations tasks which are made up of 9 open-source linguistics datasets. The final Word2vec reports showcase the preference of non-default model for 2 out of 3 tasks. In case of Glove, non-default models outperform in all 3 evaluation tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题