ARAGPT2：阿拉伯语生成的预训练的变压器

论文标题

ARAGPT2：阿拉伯语生成的预训练的变压器

AraGPT2: Pre-Trained Transformer for Arabic Language Generation

论文作者

Antoun, Wissam, Baly, Fady, Hajj, Hazem

论文摘要

最近，鉴于它们对足够大的语料库进行了培训，因此被证明是基于变压器的架构在语言建模和理解方面非常有效。与其他NLP的进步相比，阿拉伯语的语言生成应用仍在落后，这主要是由于缺乏先进的阿拉伯语生成模型。在本文中，我们开发了第一个先进的阿拉伯语言生成模型Aragpt2，该模型在互联网文本和新闻文章的大型阿拉伯语语料库中从头开始训练。我们最大的模型Aragpt2-Mega具有14.6亿个参数，这使其成为可用的最大的阿拉伯语模型。对巨型模型进行了评估，并在包括合成新闻的不同任务和零照片的答案中显示了成功。对于文本生成，我们最佳模型在持有的Wikipedia文章上实现了29.8的困惑。对人类评估者进行的一项研究表明，Aragpt2-Mega在生成很难与人类撰写的文章的新闻文章中取得了重大成功。因此，我们在检测模型生成的文本时开发和释放具有98％精度的自动鉴别模型。这些模型也公开可用，希望鼓励阿拉伯语NLP的新研究方向和应用。

Recently, pre-trained transformer-based architectures have proven to be very efficient at language modeling and understanding, given that they are trained on a large enough corpus. Applications in language generation for Arabic are still lagging in comparison to other NLP advances primarily due to the lack of advanced Arabic language generation models. In this paper, we develop the first advanced Arabic language generation model, AraGPT2, trained from scratch on a large Arabic corpus of internet text and news articles. Our largest model, AraGPT2-mega, has 1.46 billion parameters, which makes it the largest Arabic language model available. The Mega model was evaluated and showed success on different tasks including synthetic news generation, and zero-shot question answering. For text generation, our best model achieves a perplexity of 29.8 on held-out Wikipedia articles. A study conducted with human evaluators showed the significant success of AraGPT2-mega in generating news articles that are difficult to distinguish from articles written by humans. We thus develop and release an automatic discriminator model with a 98% percent accuracy in detecting model-generated text. The models are also publicly available, hoping to encourage new research directions and applications for Arabic NLP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题