Bitext采矿使用低资源语言的蒸馏句表示

论文标题

Bitext采矿使用低资源语言的蒸馏句表示

Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages

论文作者

Heffernan, Kevin, Çelebi, Onur, Schwenk, Holger

论文摘要

扩展多语言表示学习超出数百种最常见的语言是具有挑战性的，特别是涵盖了低资源语言的长尾。一种有前途的方法是训练能够跨语性转移的所有多语言模型，但是这些模型通常遭受无关语言之间的容量和干扰。取而代之的是，我们摆脱了这种方法，专注于培训多语言（家庭）特定表示，但最突出的是使所有语言仍然可以在同一代表空间中进行编码。为了实现这一目标，我们专注于教师培训，使所有编码者都可以互相兼容，以便于Bitext采矿，并可以快速学习新语言。我们介绍了一种新的教师培训计划，该计划结合了受监督和自我监督的培训，使编码者可以利用单语培训数据，这在低资源环境中很有价值。我们的方法大大优于原始激光编码器。我们研究了非常低的资源语言并处理50种非洲语言，其中许多语言不受任何其他模型涵盖。对于这些语言，我们通过培训NMT系统训练句子编码器，矿山bitexts并验证bitexts。

Scaling multilingual representation learning beyond the hundred most frequent languages is challenging, in particular to cover the long tail of low-resource languages. A promising approach has been to train one-for-all multilingual models capable of cross-lingual transfer, but these models often suffer from insufficient capacity and interference between unrelated languages. Instead, we move away from this approach and focus on training multiple language (family) specific representations, but most prominently enable all languages to still be encoded in the same representational space. To achieve this, we focus on teacher-student training, allowing all encoders to be mutually compatible for bitext mining, and enabling fast learning of new languages. We introduce a new teacher-student training scheme which combines supervised and self-supervised training, allowing encoders to take advantage of monolingual training data, which is valuable in the low-resource setting. Our approach significantly outperforms the original LASER encoder. We study very low-resource languages and handle 50 African languages, many of which are not covered by any other model. For these languages, we train sentence encoders, mine bitexts, and validate the bitexts by training NMT systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题