论文标题
低资源语言的基于锚的双语单词嵌入
Anchor-based Bilingual Word Embeddings for Low-Resource Languages
论文作者
论文摘要
可以为具有大量未标记文本的语言构建高质量的单语嵌入(MWES)。仅使用几千个单词翻译对就可以将MWE与双语空间对齐。对于低资源语言,培训MWE的MWES会导致质量较差的MWE,因此双语单词嵌入(BWES)也很差。本文提出了一种构建BWE的新方法,其中将高资源源语言的向量空间用作训练低资源目标语言的嵌入空间的起点。通过使用源矢量作为锚固,媒介空间会在训练过程中自动对齐。我们试验英语 - 德文,英语 - 吉尼农和英语 - 麦角句。我们表明,我们的方法不仅在改善的BWES和双语词典诱导性能中导致,而且还以使用单语词相似性测量的目标语言MWE质量提高了目标语言质量。
Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text. MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs. For low resource languages training MWEs monolingually results in MWEs of poor quality, and thus poor bilingual word embeddings (BWEs) as well. This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point for training an embedding space for the low resource target language. By using the source vectors as anchors the vector spaces are automatically aligned during training. We experiment on English-German, English-Hiligaynon and English-Macedonian. We show that our approach results not only in improved BWEs and bilingual lexicon induction performance, but also in improved target language MWE quality as measured using monolingual word similarity.