Afriwoz：用于利用跨语性可转让性来生成低资源，非洲语言的对话

论文标题

Afriwoz：用于利用跨语性可转让性来生成低资源，非洲语言的对话

AfriWOZ: Corpus for Exploiting Cross-Lingual Transferability for Generation of Dialogues in Low-Resource, African Languages

论文作者

Adewumi, Tosin, Adeyemi, Mofetoluwa, Anuoluwapo, Aremu, Peters, Bukola, Buzaaba, Happy, Samuel, Oyerinde, Rufai, Amina Mardiyyah, Ajibade, Benjamin, Gwadabe, Tajudeen, Traore, Mory Moussou Koulibaly, Ajayi, Tunde, Muhammad, Shamsuddeen, Baruwa, Ahmed, Owoicho, Paul, Ogunremi, Tolulope, Ngigi, Phylis, Ahia, Orevaoghene, Nasir, Ruqayya, Liwicki, Foteini, Liwicki, Marcus

论文摘要

对话生成是充满许多挑战的重要NLP任务。对于低资源的非洲语言，挑战变得更加艰巨。为了创建非洲语言的对话推动者，我们为6种非洲语言的第一个高质量对话数据集提供了贡献：斯瓦希里语，沃洛夫，豪萨，尼日利亚尼日利亚的Pidgin English，Kinyarwanda和Yorùbá。这些数据集由1,500个旋转组成，我们从英语多域Multiwoz数据集的一部分转换。随后，我们通过利用最新的（SOTA）深层单语言模型来调查和分析通过转移学习的有效性：对话和搅拌器。我们使用困惑性比较了模型的简单SEQ2SEQ基线。除此之外，我们还通过使用多数票并衡量通知者协议（IAA）对单转交谈进行人体评估。我们发现，深层单语模型学会了一些跨语言所具有的抽象的假设。在6种语言中，我们观察到了不同程度的人类对话。最容易转移特性的语言是尼日利亚的Pidgin英语，人类风格得分为78.1％，其中34.4％是一致的。我们自由提供数据集并在HuggingFace Hub上托管模型检查点/演示，以供公众访问。

Dialogue generation is an important NLP task fraught with many challenges. The challenges become more daunting for low-resource African languages. To enable the creation of dialogue agents for African languages, we contribute the first high-quality dialogue datasets for 6 African languages: Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yorùbá. These datasets consist of 1,500 turns each, which we translate from a portion of the English multi-domain MultiWOZ dataset. Subsequently, we investigate & analyze the effectiveness of modelling through transfer learning by utilziing state-of-the-art (SoTA) deep monolingual models: DialoGPT and BlenderBot. We compare the models with a simple seq2seq baseline using perplexity. Besides this, we conduct human evaluation of single-turn conversations by using majority votes and measure inter-annotator agreement (IAA). We find that the hypothesis that deep monolingual models learn some abstractions that generalize across languages holds. We observe human-like conversations, to different degrees, in 5 out of the 6 languages. The language with the most transferable properties is the Nigerian Pidgin English, with a human-likeness score of 78.1%, of which 34.4% are unanimous. We freely provide the datasets and host the model checkpoints/demos on the HuggingFace hub for public access.

下载PDF全文

下载文献需遵守相关版权规定

论文标题