Kinyabert：一种形态学的Kinyarwanda语言模型

论文标题

Kinyabert：一种形态学的Kinyarwanda语言模型

KinyaBERT: a Morphology-aware Kinyarwanda Language Model

论文作者

Nzeyimana, Antoine, Rubungo, Andre Niyongabo

论文摘要

诸如BERT之类的预训练的语言模型已成功解决许多自然语言处理任务。但是，在处理形态丰富的语言时，这些模型中常用的无监督子词的子词化方法（例如，字节对编码-BPE）在处理形态丰富的语言时是最佳的。即使鉴于形态学分析仪，将词素幼稚的测序纳入标准的BERT结构方面也无效地捕获形态的组成性和表达单词相关的句法规律性。我们通过提出一种简单而有效的两层bert体系结构来解决这些挑战，该体系结构利用形态分析仪并明确表示形态学组成。尽管BERT取得了成功，但其大多数评估都是在高资源语言上进行的，掩盖了其对低资源语言的适用性。我们评估了我们提出的方法，以命名拟议的模型架构Kinyabert命名了较低的形态丰富的Kinyarwanda语言。一组强大的实验结果表明，在指定的实体识别任务上，Kinyabert在F1分数中优于实体基线，在机器翻译基准的平均得分中，Kinyabert的表现优于固体基线。 Kinyabert微调具有更好的收敛性，即使在存在翻译噪声的情况下，在多个任务上都能获得更强大的结果。

Pre-trained language models such as BERT have been successful at tackling many natural language processing tasks. However, the unsupervised sub-word tokenization methods commonly used in these models (e.g., byte-pair encoding - BPE) are sub-optimal at handling morphologically rich languages. Even given a morphological analyzer, naive sequencing of morphemes into a standard BERT architecture is inefficient at capturing morphological compositionality and expressing word-relative syntactic regularities. We address these challenges by proposing a simple yet effective two-tier BERT architecture that leverages a morphological analyzer and explicitly represents morphological compositionality. Despite the success of BERT, most of its evaluations have been conducted on high-resource languages, obscuring its applicability on low-resource languages. We evaluate our proposed method on the low-resource morphologically rich Kinyarwanda language, naming the proposed model architecture KinyaBERT. A robust set of experimental results reveal that KinyaBERT outperforms solid baselines by 2% in F1 score on a named entity recognition task and by 4.3% in average score of a machine-translated GLUE benchmark. KinyaBERT fine-tuning has better convergence and achieves more robust results on multiple tasks even in the presence of translation noise.

下载PDF全文

下载文献需遵守相关版权规定

论文标题