论文标题
结合静态和上下文化的多语言嵌入
Combining Static and Contextualised Multilingual Embeddings
论文作者
论文摘要
静态和上下文多语言嵌入具有互补的优势。静态嵌入虽然比上下文语言模型的表现力较低,但可以在多种语言中更直接地对齐。我们结合了静态和上下文模型的优势,以改善多语言表示。我们从XLM-R中提取40种语言的静态嵌入,用跨语言单词检索验证那些嵌入,然后使用vecmap对齐它们。这导致高质量,高度多语言的静态嵌入。然后,我们将一种新型的持续训练方法应用于XLM-R,利用静态嵌入的高质量对齐,以更好地对齐XLM-R的表示空间。我们对多个复杂的语义任务显示了积极的结果。我们发布静态嵌入和持续的预训练代码。与大多数以前的工作不同,我们继续培训的方法不需要平行文本。
Static and contextual multilingual embeddings have complementary strengths. Static embeddings, while less expressive than contextual language models, can be more straightforwardly aligned across multiple languages. We combine the strengths of static and contextual models to improve multilingual representations. We extract static embeddings for 40 languages from XLM-R, validate those embeddings with cross-lingual word retrieval, and then align them using VecMap. This results in high-quality, highly multilingual static embeddings. Then we apply a novel continued pre-training approach to XLM-R, leveraging the high quality alignment of our static embeddings to better align the representation space of XLM-R. We show positive results for multiple complex semantic tasks. We release the static embeddings and the continued pre-training code. Unlike most previous work, our continued pre-training approach does not require parallel text.