论文标题
多语言表示与对比度学习
Multilingual Representation Distillation with Contrastive Learning
论文作者
论文摘要
来自大型模型的多语言句子表示编码来自两种或多种语言的语义信息,可用于不同的跨语义信息检索和匹配任务。在本文中,我们将对比度学习整合到多语言表示蒸馏中,并将其用于平行句子的质量估算(即,找到可以用作彼此翻译的语义上相似的句子)。我们通过多语言相似性搜索和语料库过滤任务来验证我们的方法。跨不同低资源语言的实验表明,我们的方法大大优于先前的句子编码器,例如Laser,Laser3和Labse。
Multilingual sentence representations from large models encode semantic information from two or more languages and can be used for different cross-lingual information retrieval and matching tasks. In this paper, we integrate contrastive learning into multilingual representation distillation and use it for quality estimation of parallel sentences (i.e., find semantically similar sentences that can be used as translations of each other). We validate our approach with multilingual similarity search and corpus filtering tasks. Experiments across different low-resource languages show that our method greatly outperforms previous sentence encoders such as LASER, LASER3, and LaBSE.