论文标题

在polyllot单词嵌入中发现双语词典

Discovering Bilingual Lexicons in Polyglot Word Embeddings

论文作者

KhudaBukhsh, Ashiqur R., Palakodety, Shriphani, Mitchell, Tom M.

论文摘要

双语词典和短语表是现代机器翻译系统的关键资源。尽管最近的结果表明,如果没有任何种子词典或平行数据,则可以使用无监督的方法来学习高度准确的双语词典,因此这种方法依赖于大型,干净的单语库中的存在。在这项工作中,我们利用了一个单个跳过的模型,该模型在多语种语料库中训练,该模型产生多面有单词的嵌入,并提出了一个新颖的发现,即在此嵌入空间中,一个令人惊讶的简单约束最近的最近的邻居抽样技术可以检索双语词典,即使在粗糙的社交媒体数据集中以英语和Romanized Hindi和Romanized Hindi Condorts和Romanized Hindi Condection和Romanized Hindi Condection和Romanized Hindi Schocks and Codecy Schoce scoress and Codecy Sconte and Condection Schoce。我们的方法不需要单语的语料库,种子词典或任何其他此类资源。此外,在三种欧洲语言对中,我们观察到,可以使用我们受约束的最近的邻居采样来检索多语言单词嵌入的嵌入确实可以学习单词的丰富语义表示和实质性的双语词典。我们研究了涵盖干净文本和嘈杂的社交媒体数据集的设置中的潜在原因和下游应用程序,以及资源丰富和资源不足的语言对。

Bilingual lexicons and phrase tables are critical resources for modern Machine Translation systems. Although recent results show that without any seed lexicon or parallel data, highly accurate bilingual lexicons can be learned using unsupervised methods, such methods rely on the existence of large, clean monolingual corpora. In this work, we utilize a single Skip-gram model trained on a multilingual corpus yielding polyglot word embeddings, and present a novel finding that a surprisingly simple constrained nearest-neighbor sampling technique in this embedding space can retrieve bilingual lexicons, even in harsh social media data sets predominantly written in English and Romanized Hindi and often exhibiting code switching. Our method does not require monolingual corpora, seed lexicons, or any other such resources. Additionally, across three European language pairs, we observe that polyglot word embeddings indeed learn a rich semantic representation of words and substantial bilingual lexicons can be retrieved using our constrained nearest neighbor sampling. We investigate potential reasons and downstream applications in settings spanning both clean texts and noisy social media data sets, and in both resource-rich and under-resourced language pairs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源