论文标题
Sigmorphon 2020共享任务0:类型上多样化的形态变化
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
论文作者
论文摘要
自然语言处理(NLP)的广泛目标是开发具有处理任何自然语言的能力的系统。但是,大多数系统都是使用来自英语等语言的数据开发的。 Sigmorphon 2020关于形态重新攻击的共享任务旨在研究系统在类型上截然不同的语言中概括的能力,其中许多语言是低资源。系统是使用来自45种语言和5个语言系列的数据开发的,并通过其他45种语言和10个语言系列(总共13个)的数据进行了微调,并对所有90种语言进行了评估。从10个团队中总共有22个系统(19个神经)。所有四个获奖系统均为神经(两个单语言变压器和两个大型多语言RNN的模型,具有封闭式注意力)。大多数团队都展示了数据幻觉和增强,合奏以及对低资源语言的多语言培训的实用性。非神经学习者和手动设计的语法在某些语言上表现出竞争力甚至更高的表现(例如英格丽安,塔吉克,他加禄语,Zarma,lingala),尤其是数据非常有限。对于大多数系统而言,一些语言家族(非洲亚洲,尼日尔 - 汤戈)相对容易,并且达到90%以上的平均准确性,而另一些则更具挑战性。
A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages, many of which are low resource. Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages. A total of 22 systems (19 neural) from 10 teams were submitted to the task. All four winning systems were neural (two monolingual transformers and two massively multilingual RNN-based models with gated attention). Most teams demonstrate utility of data hallucination and augmentation, ensembles, and multilingual training for low-resource languages. Non-neural learners and manually designed grammars showed competitive and even superior performance on some languages (such as Ingrian, Tajik, Tagalog, Zarma, Lingala), especially with very limited data. Some language families (Afro-Asiatic, Niger-Congo, Turkic) were relatively easy for most systems and achieved over 90% mean accuracy while others were more challenging.