论文标题

指示转化器:印度语言变压器语言模型的分析

Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages

论文作者

Jain, Kushal, Deshpande, Adwait, Shridhar, Kumar, Laumann, Felix, Dash, Ayushman

论文摘要

基于变压器体系结构的语言模型已在多种NLP任务(例如文本分类,提问和令牌分类)上实现了最先进的性能。但是,这种表演通常在高资源语言上进行测试和报告,例如英语,法语,西班牙语和德语。另一方面,印度语言在这种基准中的代表性不足。尽管一些印度语言被包括在培训多语言变压器模型中,但它们并不是此类工作的主要重点。为了具体评估印度语言的表现,我们通过对印地语,孟加拉语和泰卢固语的多个下游任务进行广泛的实验来分析这些语言模型。在这里,我们比较了预训练模型的微调模型参数与从头开始训练语言模型的疗效。此外,我们从经验上反对数据集大小和模型性能之间的严格依赖性,而是鼓励特定于任务的模型和方法选择。我们在文本分类任务上实现印地语和孟加拉语的最新性能。最后,我们提出了处理印度语言建模的有效策略,并为社区发布了模型检查站:https://huggingface.co/neuralspace-reverie。

Language models based on the Transformer architecture have achieved state-of-the-art performance on a wide range of NLP tasks such as text classification, question-answering, and token classification. However, this performance is usually tested and reported on high-resource languages, like English, French, Spanish, and German. Indian languages, on the other hand, are underrepresented in such benchmarks. Despite some Indian languages being included in training multilingual Transformer models, they have not been the primary focus of such work. In order to evaluate the performance on Indian languages specifically, we analyze these language models through extensive experiments on multiple downstream tasks in Hindi, Bengali, and Telugu language. Here, we compare the efficacy of fine-tuning model parameters of pre-trained models against that of training a language model from scratch. Moreover, we empirically argue against the strict dependency between the dataset size and model performance, but rather encourage task-specific model and method selection. We achieve state-of-the-art performance on Hindi and Bengali languages for text classification task. Finally, we present effective strategies for handling the modeling of Indian languages and we release our model checkpoints for the community : https://huggingface.co/neuralspace-reverie.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源