论文标题
重印:基于主要组件的随机外推
Reprint: a randomized extrapolation based on principal components for data augmentation
论文作者
论文摘要
数据稀缺和数据失衡在许多领域都引起了很多关注。数据增强是一种有效的方法来解决它们的方法,可以通过生成新样本来提高分类模型的鲁棒性和效率。本文介绍了转载,这是一种简单有效的隐藏空间数据增强方法,用于数据分类。给定每个类中样品的隐藏空间表示,以随机的方式重印外推,通过使用由主组件跨越的子空间增强目标类的示例,以总结源和目标类别的分布结构。因此,生成的示例将使目标多样化,同时保持目标分布的原始几何形状。此外,此方法涉及标签改进组件,该组件允许合成新的软标签以进行增强示例。与在四个文本分类基准的一系列数据不平衡方案下使用不同的NLP数据增强方法相比,重印显示出明显的改进。此外,通过全面的消融研究,我们表明标签的完善比为增强示例标签提供标签优于标签,并且我们的方法提出了适当选择主要组件的选择稳定且一致的改进。此外,重印是易于使用的吸引力,因为它仅包含一个确定子空间维度的超参数,并且需要较低的计算资源。
Data scarcity and data imbalance have attracted a lot of attention in many fields. Data augmentation, explored as an effective approach to tackle them, can improve the robustness and efficiency of classification models by generating new samples. This paper presents REPRINT, a simple and effective hidden-space data augmentation method for imbalanced data classification. Given hidden-space representations of samples in each class, REPRINT extrapolates, in a randomized fashion, augmented examples for target class by using subspaces spanned by principal components to summarize distribution structure of both source and target class. Consequently, the examples generated would diversify the target while maintaining the original geometry of target distribution. Besides, this method involves a label refinement component which allows to synthesize new soft labels for augmented examples. Compared with different NLP data augmentation approaches under a range of data imbalanced scenarios on four text classification benchmark, REPRINT shows prominent improvements. Moreover, through comprehensive ablation studies, we show that label refinement is better than label-preserving for augmented examples, and that our method suggests stable and consistent improvements in terms of suitable choices of principal components. Moreover, REPRINT is appealing for its easy-to-use since it contains only one hyperparameter determining the dimension of subspace and requires low computational resource.