罗伯特：荷兰的基于罗伯塔的语言模型

论文标题

罗伯特：荷兰的基于罗伯塔的语言模型

RobBERT: a Dutch RoBERTa-based Language Model

论文作者

Delobelle, Pieter, Winters, Thomas, Berendt, Bettina

论文摘要

近年来，预训练的语言模型一直在自然语言处理领域中占主导地位，并为各种复杂的自然语言任务带来了显着的性能。伯特（Bert）是最突出的预训练语言模型之一，该模型以英语以及多语言版本发布。尽管多语言BERT在许多任务上都表现良好，但最近的研究表明，接受单语言训练的BERT模型显着优于多语言版本。因此，培训荷兰BERT模型具有许多荷兰NLP任务的潜力。虽然先前的方法已使用BERT的早期实现来训练荷兰版的BERT，但我们使用了Roberta（一种强大的优化Bert方法）来训练一种名为Robbert的荷兰语模型。我们在各种任务上衡量了其性能以及微调数据集大小的重要性。我们还评估了特定于语言的引物器和模型的公平性的重要性。我们发现，罗伯特（Robbert）改善了各种任务的最新结果，在处理较小的数据集时尤其优于其他模型。这些结果表明，它是多种荷兰语任务的强大预训练模型。预先训练和微调的模型可公开使用，以支持荷兰NLP下游的进一步型号。

Pre-trained language models have been dominating the field of natural language processing in recent years, and have led to significant performance gains for various complex natural language tasks. One of the most prominent pre-trained language models is BERT, which was released as an English as well as a multilingual version. Although multilingual BERT performs well on many tasks, recent studies show that BERT models trained on a single language significantly outperform the multilingual version. Training a Dutch BERT model thus has a lot of potential for a wide range of Dutch NLP tasks. While previous approaches have used earlier implementations of BERT to train a Dutch version of BERT, we used RoBERTa, a robustly optimized BERT approach, to train a Dutch language model called RobBERT. We measured its performance on various tasks as well as the importance of the fine-tuning dataset size. We also evaluated the importance of language-specific tokenizers and the model's fairness. We found that RobBERT improves state-of-the-art results for various tasks, and especially significantly outperforms other models when dealing with smaller datasets. These results indicate that it is a powerful pre-trained model for a large variety of Dutch language tasks. The pre-trained and fine-tuned models are publicly available to support further downstream Dutch NLP applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题