sinhala-对僧伽罗文本分类的预训练的语言模型的全面分析

论文标题

sinhala-对僧伽罗文本分类的预训练的语言模型的全面分析

BERTifying Sinhala -- A Comprehensive Analysis of Pre-trained Language Models for Sinhala Text Classification

论文作者

Dhananjaya, Vinura, Demotte, Piyumal, Ranathunga, Surangika, Jayasena, Sanath

论文摘要

这项研究提供了对僧伽罗文本分类的预训练语言模型的性能的首次全面分析。我们测试了一组不同的Sinhala文本分类任务，我们的分析表明，在包括Sinhala（XLM-R，Labse和Laser）在内的预训练的多语言模型中，XLM-R是迄今为止Sinhala文本分类的最佳模型。我们还预先培训了两种基于罗伯塔的单语僧伽罗模型，它们远远优于僧伽罗的现有预训练的语言模型。我们表明，在微调时，这些预训练的语言模型为僧伽罗文本分类树立了非常强大的基线，并且在标记数据不足以进行微调的情况下非常强大。我们进一步提供了一组建议，以使用预训练的模型进行僧伽罗文本分类。我们还介绍了新的注释数据集，可用于僧伽罗文本分类的未来研究，并公开发布我们的预培训模型。

This research provides the first comprehensive analysis of the performance of pre-trained language models for Sinhala text classification. We test on a set of different Sinhala text classification tasks and our analysis shows that out of the pre-trained multilingual models that include Sinhala (XLM-R, LaBSE, and LASER), XLM-R is the best model by far for Sinhala text classification. We also pre-train two RoBERTa-based monolingual Sinhala models, which are far superior to the existing pre-trained language models for Sinhala. We show that when fine-tuned, these pre-trained language models set a very strong baseline for Sinhala text classification and are robust in situations where labeled data is insufficient for fine-tuning. We further provide a set of recommendations for using pre-trained models for Sinhala text classification. We also introduce new annotated datasets useful for future research in Sinhala text classification and publicly release our pre-trained models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题