论文标题
X-SRL:平行的跨语性语义角色标签数据集
X-SRL: A Parallel Cross-Lingual Semantic Role Labeling Dataset
论文作者
论文摘要
即使对许多语言进行了SRL的研究,但对于英语来说,大多数情况都得到了重大改进,为此提供了更多资源。实际上,现有的多语言SRL数据集包含不同的注释样式或来自不同的域,从而阻碍了多语言学习的概括。在这项工作中,我们提出了一种自动构建一种平行于四种语言的SRL语料库的方法:英语,法语,德语,西班牙语,具有统一的谓词和角色注释,这些谓词和角色注释在各种语言之间完全可比。我们将高质量的机器翻译应用于英语Conll-09数据集,并使用多语言BERT将其高质量注释投射到目标语言。我们包括用于测量投影质量的人类验证测试集,并表明投影比强基线更稠密,更精确。最后,我们在新颖的语料库上训练不同的SOTA模型,以进行单语和多语言SRL,这表明多语言注释可以改善性能,尤其是对于较弱的语言。
Even though SRL is researched for many languages, major improvements have mostly been obtained for English, for which more resources are available. In fact, existing multilingual SRL datasets contain disparate annotation styles or come from different domains, hampering generalization in multilingual learning. In this work, we propose a method to automatically construct an SRL corpus that is parallel in four languages: English, French, German, Spanish, with unified predicate and role annotations that are fully comparable across languages. We apply high-quality machine translation to the English CoNLL-09 dataset and use multilingual BERT to project its high-quality annotations to the target languages. We include human-validated test sets that we use to measure the projection quality, and show that projection is denser and more precise than a strong baseline. Finally, we train different SOTA models on our novel corpus for mono- and multilingual SRL, showing that the multilingual annotations improve performance especially for the weaker languages.