论文标题
Afriwoz:用于利用跨语性可转让性来生成低资源,非洲语言的对话
AfriWOZ: Corpus for Exploiting Cross-Lingual Transferability for Generation of Dialogues in Low-Resource, African Languages
论文作者
论文摘要
对话生成是充满许多挑战的重要NLP任务。对于低资源的非洲语言,挑战变得更加艰巨。为了创建非洲语言的对话推动者,我们为6种非洲语言的第一个高质量对话数据集提供了贡献:斯瓦希里语,沃洛夫,豪萨,尼日利亚尼日利亚的Pidgin English,Kinyarwanda和Yorùbá。这些数据集由1,500个旋转组成,我们从英语多域Multiwoz数据集的一部分转换。随后,我们通过利用最新的(SOTA)深层单语言模型来调查和分析通过转移学习的有效性:对话和搅拌器。我们使用困惑性比较了模型的简单SEQ2SEQ基线。除此之外,我们还通过使用多数票并衡量通知者协议(IAA)对单转交谈进行人体评估。我们发现,深层单语模型学会了一些跨语言所具有的抽象的假设。在6种语言中,我们观察到了不同程度的人类对话。最容易转移特性的语言是尼日利亚的Pidgin英语,人类风格得分为78.1%,其中34.4%是一致的。我们自由提供数据集并在HuggingFace Hub上托管模型检查点/演示,以供公众访问。
Dialogue generation is an important NLP task fraught with many challenges. The challenges become more daunting for low-resource African languages. To enable the creation of dialogue agents for African languages, we contribute the first high-quality dialogue datasets for 6 African languages: Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yorùbá. These datasets consist of 1,500 turns each, which we translate from a portion of the English multi-domain MultiWOZ dataset. Subsequently, we investigate & analyze the effectiveness of modelling through transfer learning by utilziing state-of-the-art (SoTA) deep monolingual models: DialoGPT and BlenderBot. We compare the models with a simple seq2seq baseline using perplexity. Besides this, we conduct human evaluation of single-turn conversations by using majority votes and measure inter-annotator agreement (IAA). We find that the hypothesis that deep monolingual models learn some abstractions that generalize across languages holds. We observe human-like conversations, to different degrees, in 5 out of the 6 languages. The language with the most transferable properties is the Nigerian Pidgin English, with a human-likeness score of 78.1%, of which 34.4% are unanimous. We freely provide the datasets and host the model checkpoints/demos on the HuggingFace hub for public access.