论文标题
XGLUE:用于跨语义预训练,理解和生成的新基准数据集
XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation
论文作者
论文摘要
在本文中,我们介绍了XGLUE,这是一种新的基准数据集,可用于使用多语言和双语语料库来训练大规模的跨语性预训练模型,并在各种跨语言任务中评估其性能。与Glue(Wang等人,2019年)相比,仅以英语为单位的自然语言理解任务,XGLUE具有两个主要优势:(1)它提供了11个多元化的任务,涵盖了自然语言理解和一代情景; (2)对于每个任务,它以多种语言提供标记的数据。我们扩展了最近的跨语性预训练Unicoder(Huang等,2019),以涵盖理解和生成任务,这是在XGLUE上评估为强大的基线。我们还评估了多语言BERT,XLM和XLM-R的基本版本(12层),以进行比较。
In this paper, we introduce XGLUE, a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora and evaluate their performance across a diverse set of cross-lingual tasks. Comparing to GLUE(Wang et al., 2019), which is labeled in English for natural language understanding tasks only, XGLUE has two main advantages: (1) it provides 11 diversified tasks that cover both natural language understanding and generation scenarios; (2) for each task, it provides labeled data in multiple languages. We extend a recent cross-lingual pre-trained model Unicoder(Huang et al., 2019) to cover both understanding and generation tasks, which is evaluated on XGLUE as a strong baseline. We also evaluate the base versions (12-layer) of Multilingual BERT, XLM and XLM-R for comparison.