论文标题

英语对比学习可以学习通用的跨语性句子嵌入

English Contrastive Learning Can Learn Universal Cross-lingual Sentence Embeddings

论文作者

Wang, Yau-Shian, Wu, Ashley, Neubig, Graham

论文摘要

通用的跨语性句子嵌入语将语义上相似的跨语义句子映射到共享的嵌入空间中。对齐跨语性句子嵌入通常需要监督的跨语性平行句子。在这项工作中,我们提出了MSIMCSE,它将SIMCSE扩展到了多语言设置,并揭示了对英语数据的对比学习可以令人惊讶地学习高质量的通用跨语性句子嵌入,而无需任何并行数据。在无监督和弱监督的设置中,MSIMCSE显着改善了有关跨语言检索和多语言STS任务的先前句子嵌入方法。无监督的MSIMCSE的性能与检索低资源语言和多语言ST的完全监督方法相媲美。当提供跨语义NLI数据时,可以进一步提高性能。我们的代码可在https://github.com/yaushian/msimcse上公开获取。

Universal cross-lingual sentence embeddings map semantically similar cross-lingual sentences into a shared embedding space. Aligning cross-lingual sentence embeddings usually requires supervised cross-lingual parallel sentences. In this work, we propose mSimCSE, which extends SimCSE to multilingual settings and reveal that contrastive learning on English data can surprisingly learn high-quality universal cross-lingual sentence embeddings without any parallel data. In unsupervised and weakly supervised settings, mSimCSE significantly improves previous sentence embedding methods on cross-lingual retrieval and multilingual STS tasks. The performance of unsupervised mSimCSE is comparable to fully supervised methods in retrieving low-resource languages and multilingual STS. The performance can be further enhanced when cross-lingual NLI data is available. Our code is publicly available at https://github.com/yaushian/mSimCSE.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源