培训问题回答合成数据的模型

论文标题

培训问题回答合成数据的模型

Training Question Answering Models From Synthetic Data

论文作者

Puri, Raul, Spring, Ryan, Patwary, Mostofa, Shoeybi, Mohammad, Catanzaro, Bryan

论文摘要

问答生成是一种数据增强方法，旨在鉴于人类标记的数据有限，旨在改善问题答案（QA）模型。然而，在合成和人类生成的问答对之间仍然存在很大的差距。这项工作旨在通过利用大型语言模型来缩小这一差距，并探讨多个因素，例如模型大小，预审预周座模型，合成数据的比例和算法选择。在Squad1.1问答任务上，与仅使用Squead1.1训练设置问题相比，我们仅使用合成问题和答案实现了更高的精度。删除访问对实际Wikipedia数据的访问，我们从由83亿个参数GPT-2模型生成的合成语料库中综合了问题和答案。由于无法访问人类监督，只能访问其他模型，我们就可以在完全生成的数据上训练最先进的问题，以回答网络，这些数据达到了88.4精确匹配（EM）和Squead1.1 Dev Set上的93.9 F1分数。我们进一步将我们的方法应用于Squad2.0，与使用合成数据相比，与先前的工作相比，EM得分的绝对增益为2.8。

Question and answer generation is a data augmentation method that aims to improve question answering (QA) models given the limited amount of human labeled data. However, a considerable gap remains between synthetic and human-generated question-answer pairs. This work aims to narrow this gap by taking advantage of large language models and explores several factors such as model size, quality of pretrained models, scale of data synthesized, and algorithmic choices. On the SQuAD1.1 question answering task, we achieve higher accuracy using solely synthetic questions and answers than when using the SQuAD1.1 training set questions alone. Removing access to real Wikipedia data, we synthesize questions and answers from a synthetic corpus generated by an 8.3 billion parameter GPT-2 model. With no access to human supervision and only access to other models, we are able to train state of the art question answering networks on entirely model-generated data that achieve 88.4 Exact Match (EM) and 93.9 F1 score on the SQuAD1.1 dev set. We further apply our methodology to SQuAD2.0 and show a 2.8 absolute gain on EM score compared to prior work using synthetic data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题