自动编码改善预训练的单词嵌入

论文标题

自动编码改善预训练的单词嵌入

Autoencoding Improves Pre-trained Word Embeddings

论文作者

Kaneko, Masahiro, Bollegala, Danushka

论文摘要

先前研究预训练单词嵌入的几何形状的工作表明，要分布在狭窄的锥体中，并通过使用主成分向量进行核心和投影，可以提高给定的一组预训练单词嵌入的精度。但是，从理论上讲，此后处理步骤相当于应用线性自动编码器以最大程度地减少平方L2重建误差。该结果与先前的工作相矛盾（MU和Viswanath，2018年），该工作提议从预训练的嵌入中删除最高的主要成分。我们通过实验验证了我们的理论主张，并表明保留顶级主要组件确实对于改善预训练的单词嵌入而无需访问其他语言资源或标记的数据很有用。

Prior work investigating the geometry of pre-trained word embeddings have shown that word embeddings to be distributed in a narrow cone and by centering and projecting using principal component vectors one can increase the accuracy of a given set of pre-trained word embeddings. However, theoretically, this post-processing step is equivalent to applying a linear autoencoder to minimise the squared l2 reconstruction error. This result contradicts prior work (Mu and Viswanath, 2018) that proposed to remove the top principal components from pre-trained embeddings. We experimentally verify our theoretical claims and show that retaining the top principal components is indeed useful for improving pre-trained word embeddings, without requiring access to additional linguistic resources or labelled data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题