基于课程的自我培训使数据到文本生成更好

论文标题

基于课程的自我培训使数据到文本生成更好

Curriculum-Based Self-Training Makes Better Few-Shot Learners for Data-to-Text Generation

论文作者

Ke, Pei, Ji, Haozhe, Yang, Zhenyu, Huang, Yi, Feng, Junlan, Zhu, Xiaoyan, Huang, Minlie

论文摘要

尽管在各种自然语言生成（NLG）任务中，文本到文本的预培训模型成功，但生成性能在很大程度上受到下游任务中标记的数据的数量的限制，尤其是在数据到文本生成任务中。现有的作品主要利用丰富的未标记结构化数据来进行无监督的任务适应预训练，这无法建模源结构化数据和目标文本之间的复杂关系。因此，我们将自我训练作为比任务自适应的预训练更好的几次学习者的介绍，该培训通过预先训练的模型生成的伪标记数据明确捕获了这种关系。为了减轻自训练期间低质量伪标记数据的副作用，我们提出了一种新的方法，称为基于课程的自我培训（CBST），以在由文本生成难度确定的重排顺序中有效利用未标记的数据。实验结果表明，我们的方法可以胜过微调和任务自适应的预训练方法，并在数据到文本生成的几次设置中实现最先进的性能。

Despite the success of text-to-text pre-trained models in various natural language generation (NLG) tasks, the generation performance is largely restricted by the number of labeled data in downstream tasks, particularly in data-to-text generation tasks. Existing works mostly utilize abundant unlabeled structured data to conduct unsupervised pre-training for task adaption, which fail to model the complex relationship between source structured data and target texts. Thus, we introduce self-training as a better few-shot learner than task-adaptive pre-training, which explicitly captures this relationship via pseudo-labeled data generated by the pre-trained model. To alleviate the side-effect of low-quality pseudo-labeled data during self-training, we propose a novel method called Curriculum-Based Self-Training (CBST) to effectively leverage unlabeled data in a rearranged order determined by the difficulty of text generation. Experimental results show that our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题