DAGA：使用生成方法进行低资源标记任务的数据增强

论文标题

DAGA：使用生成方法进行低资源标记任务的数据增强

DAGA: Data Augmentation with a Generation Approach for Low-resource Tagging Tasks

论文作者

Ding, Bosheng, Liu, Linlin, Bing, Lidong, Kruengkrai, Canasai, Nguyen, Thien Hai, Joty, Shafiq, Si, Luo, Miao, Chunyan

论文摘要

数据增强技术已被广泛用于提高机器学习性能，因为它们增强了模型的概括能力。在这项工作中，为了生成用于低资源标记任务的高质量合成数据，我们提出了一种新颖的增强方法，并使用对线性化标记句子进行训练的语言模型。我们的方法适用于受监督和半监督的设置。对于监督设置，我们对命名实体识别（NER）进行了广泛的实验，这是语音（POS）标记和基于端到端目标的情感分析（E2E-TBSA）任务的一部分。对于半监督的设置，我们仅在给定未标记的数据和未标记的数据以及知识库的条件下评估了NER任务的方法。结果表明，我们的方法可以始终超过基准，尤其是当给定的黄金训练数据较少时。

Data augmentation techniques have been widely used to improve machine learning performance as they enhance the generalization capability of models. In this work, to generate high quality synthetic data for low-resource tagging tasks, we propose a novel augmentation method with language models trained on the linearized labeled sentences. Our method is applicable to both supervised and semi-supervised settings. For the supervised settings, we conduct extensive experiments on named entity recognition (NER), part of speech (POS) tagging and end-to-end target based sentiment analysis (E2E-TBSA) tasks. For the semi-supervised settings, we evaluate our method on the NER task under the conditions of given unlabeled data only and unlabeled data plus a knowledge base. The results show that our method can consistently outperform the baselines, particularly when the given gold training data are less.

下载PDF全文

下载文献需遵守相关版权规定

论文标题