PROMDA：低资源NLU任务的基于及时的数据增强

论文标题

PROMDA：低资源NLU任务的基于及时的数据增强

PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks

论文作者

Wang, Yufei, Xu, Can, Sun, Qingfeng, Hu, Huang, Tao, Chongyang, Geng, Xiubo, Jiang, Daxin

论文摘要

本文着重于低资源自然语言理解（NLU）任务的数据增强。我们提出了基于及时的D} ATA增强模型（ProMDA），该模型仅在冷冻预训练的语言模型（PLMS）中训练小规模的软提示（即一组可训练的向量）。这避免了人类在收集未标记的内域数据上的努力，并保持生成的合成数据的质量。此外，ProMDA通过使用NLU模型通过两个不同的视图和过滤器来生成综合数据。在四个基准上进行的实验表明，ProMDA生成的合成数据成功地提高了NLU模型的性能，NLU模型的性能始终超过了几个竞争性基线模型，包括使用未标记的内域数据的最先进的半监督模型。来自ProMDA的合成数据也与未标记的内域数据互补。 NLU模型合并以进行培训时，它们可以进一步改进。

This paper focuses on the Data Augmentation for low-resource Natural Language Understanding (NLU) tasks. We propose Prompt-based D}ata Augmentation model (PromDA) which only trains small-scale Soft Prompt (i.e., a set of trainable vectors) in the frozen Pre-trained Language Models (PLMs). This avoids human effort in collecting unlabeled in-domain data and maintains the quality of generated synthetic data. In addition, PromDA generates synthetic data via two different views and filters out the low-quality data using NLU models. Experiments on four benchmarks show that synthetic data produced by PromDA successfully boost up the performance of NLU models which consistently outperform several competitive baseline models, including a state-of-the-art semi-supervised model using unlabeled in-domain data. The synthetic data from PromDA are also complementary with unlabeled in-domain data. The NLU models can be further improved when they are combined for training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题