论文标题
改善yorùbá的变性恢复
Improving Yorùbá Diacritic Restoration
论文作者
论文摘要
Yorùbá是一种广泛使用的西非语言,其写作系统富含拼字和音调变音符号。它们提供形态学信息,对于词汇歧义,发音至关重要,对于任何计算语音或自然语言处理任务至关重要。但是,由于设备有限和应用程序支持以及对适当使用的通识教育,通常将变音符号排除在电子文本之外。我们报告了数据集种植的最新努力。通过从网络和各种个人库中汇总和改进不同的文本,我们能够将清洁的Yorùbá数据集从多数二十个文本语料库中大大发展,其中有三个来源,从十多个来源到数百万个代币。我们根据新的通用,公共域的Yorùbá评估数据集评估了更新的变化恢复模型,该模型被选为多功能,并反映了当代用法。所有预训练的模型,数据集和源代码均已作为开源项目发布,以促进对Yorùbá语言技术的努力。
Yorùbá is a widely spoken West African language with a writing system rich in orthographic and tonal diacritics. They provide morphological information, are crucial for lexical disambiguation, pronunciation and are vital for any computational Speech or Natural Language Processing tasks. However diacritic marks are commonly excluded from electronic texts due to limited device and application support as well as general education on proper usage. We report on recent efforts at dataset cultivation. By aggregating and improving disparate texts from the web and various personal libraries, we were able to significantly grow our clean Yorùbá dataset from a majority Bibilical text corpora with three sources to millions of tokens from over a dozen sources. We evaluate updated diacritic restoration models on a new, general purpose, public-domain Yorùbá evaluation dataset of modern journalistic news text, selected to be multi-purpose and reflecting contemporary usage. All pre-trained models, datasets and source-code have been released as an open-source project to advance efforts on Yorùbá language technology.