粘液：低资源语言中提问的多语言对比培训

论文标题

粘液：低资源语言中提问的多语言对比培训

MuCoT: Multilingual Contrastive Training for Question-Answering in Low-resource Languages

论文作者

Kumar, Gokul Karthik, Gehlot, Abhishek Singh, Mullappilly, Sahal Shaji, Nandakumar, Karthik

论文摘要

近年来，随着基于变压器的模型（例如BERT）的出现，英语问题答案（QA）系统的准确性已大大提高。这些模型以一种自我监督的方式进行了预培训，并具有大型英语文本语料库，并通过大量的英语QA数据集（例如，squead）进行了微调。但是，此类量表上的QA数据集对于其他大多数语言都无法使用。基于BERT的多语言模型（Mbert）通常用于将知识从高资源语言转移到低资源语言。由于这些模型已通过包含多种语言的巨大文本语料库进行了预训练，因此它们通常从不同语言中学习为代币的语言敏捷嵌入。但是，由于培训数据的匮乏，直接培训基于Mbert的QA系统，用于低资源语言。在这项工作中，我们使用翻译和音译为其他语言增强了目标语言的QA样本，并使用增强数据来微调基于Mbert的QA模型，该模型已经用英语进行了预先培训。 Google Chaii数据集上的实验表明，通过同一语言家族的翻译对Mbert模型进行微调，可以提高问题的效果表现，而在跨语言家庭的情况下，绩效降低了。我们进一步表明，在微调过程中引入了翻译的问题 - 文本功能对之间的对比损失，从而防止了通过跨语言家庭翻译的退化，并导致边际改进。这项工作的代码可在https://github.com/gokulkarthik/mucot上找到。

Accuracy of English-language Question Answering (QA) systems has improved significantly in recent years with the advent of Transformer-based models (e.g., BERT). These models are pre-trained in a self-supervised fashion with a large English text corpus and further fine-tuned with a massive English QA dataset (e.g., SQuAD). However, QA datasets on such a scale are not available for most of the other languages. Multi-lingual BERT-based models (mBERT) are often used to transfer knowledge from high-resource languages to low-resource languages. Since these models are pre-trained with huge text corpora containing multiple languages, they typically learn language-agnostic embeddings for tokens from different languages. However, directly training an mBERT-based QA system for low-resource languages is challenging due to the paucity of training data. In this work, we augment the QA samples of the target language using translation and transliteration into other languages and use the augmented data to fine-tune an mBERT-based QA model, which is already pre-trained in English. Experiments on the Google ChAII dataset show that fine-tuning the mBERT model with translations from the same language family boosts the question-answering performance, whereas the performance degrades in the case of cross-language families. We further show that introducing a contrastive loss between the translated question-context feature pairs during the fine-tuning process, prevents such degradation with cross-lingual family translations and leads to marginal improvement. The code for this work is available at https://github.com/gokulkarthik/mucot.

下载PDF全文

下载文献需遵守相关版权规定

论文标题