论文标题

C2C-GENDA:用于插槽填充数据的数据增强的群集到群集生成

C2C-GenDA: Cluster-to-Cluster Generation for Data Augmentation of Slot Filling

论文作者

Hou, Yutai, Chen, Sanyuan, Che, Wanxiang, Chen, Cheng, Liu, Ting

论文摘要

插槽填充是一种口语理解的基本模块,通常遭受培训数据的数量和多样性的不足。为了解决这个问题,我们提出了一个新型的数据增强集群到群集生成框架(DA),名为C2C-Genda。它通过将现有的话语重建为替代表达式,同时保持语义来扩大训练。与以前的DA作品不同,C2C-GENDA共同编码相同语义的多种现有话语,并同时解码多个看不见的表达式。共同产生多种新话语可以考虑生成的实例之间的关系并鼓励多样性。此外,编码多种现有话语的C2C具有更广泛的现有表达式,有助于减少复制现有数据的生成。关于ATIS和STIPS数据集的实验表明,C2C-Genda增加的实例将填充插槽填充7.99(11.9%)和5.76(13.6%)F-SCORES,而只有数百种训练说法。

Slot filling, a fundamental module of spoken language understanding, often suffers from insufficient quantity and diversity of training data. To remedy this, we propose a novel Cluster-to-Cluster generation framework for Data Augmentation (DA), named C2C-GenDA. It enlarges the training set by reconstructing existing utterances into alternative expressions while keeping semantic. Different from previous DA works that reconstruct utterances one by one independently, C2C-GenDA jointly encodes multiple existing utterances of the same semantics and simultaneously decodes multiple unseen expressions. Jointly generating multiple new utterances allows to consider the relations between generated instances and encourages diversity. Besides, encoding multiple existing utterances endows C2C with a wider view of existing expressions, helping to reduce generation that duplicates existing data. Experiments on ATIS and Snips datasets show that instances augmented by C2C-GenDA improve slot filling by 7.99 (11.9%) and 5.76 (13.6%) F-scores respectively, when there are only hundreds of training utterances.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源