论文标题
Kinyabert:一种形态学的Kinyarwanda语言模型
KinyaBERT: a Morphology-aware Kinyarwanda Language Model
论文作者
论文摘要
诸如BERT之类的预训练的语言模型已成功解决许多自然语言处理任务。但是,在处理形态丰富的语言时,这些模型中常用的无监督子词的子词化方法(例如,字节对编码-BPE)在处理形态丰富的语言时是最佳的。即使鉴于形态学分析仪,将词素幼稚的测序纳入标准的BERT结构方面也无效地捕获形态的组成性和表达单词相关的句法规律性。我们通过提出一种简单而有效的两层bert体系结构来解决这些挑战,该体系结构利用形态分析仪并明确表示形态学组成。尽管BERT取得了成功,但其大多数评估都是在高资源语言上进行的,掩盖了其对低资源语言的适用性。我们评估了我们提出的方法,以命名拟议的模型架构Kinyabert命名了较低的形态丰富的Kinyarwanda语言。一组强大的实验结果表明,在指定的实体识别任务上,Kinyabert在F1分数中优于实体基线,在机器翻译基准的平均得分中,Kinyabert的表现优于固体基线。 Kinyabert微调具有更好的收敛性,即使在存在翻译噪声的情况下,在多个任务上都能获得更强大的结果。
Pre-trained language models such as BERT have been successful at tackling many natural language processing tasks. However, the unsupervised sub-word tokenization methods commonly used in these models (e.g., byte-pair encoding - BPE) are sub-optimal at handling morphologically rich languages. Even given a morphological analyzer, naive sequencing of morphemes into a standard BERT architecture is inefficient at capturing morphological compositionality and expressing word-relative syntactic regularities. We address these challenges by proposing a simple yet effective two-tier BERT architecture that leverages a morphological analyzer and explicitly represents morphological compositionality. Despite the success of BERT, most of its evaluations have been conducted on high-resource languages, obscuring its applicability on low-resource languages. We evaluate our proposed method on the low-resource morphologically rich Kinyarwanda language, naming the proposed model architecture KinyaBERT. A robust set of experimental results reveal that KinyaBERT outperforms solid baselines by 2% in F1 score on a named entity recognition task and by 4.3% in average score of a machine-translated GLUE benchmark. KinyaBERT fine-tuning has better convergence and achieves more robust results on multiple tasks even in the presence of translation noise.