打smote，还是不打smote？

论文标题

打smote，还是不打smote？

To SMOTE, or not to SMOTE?

论文作者

Elor, Yotam, Averbuch-Elor, Hadar

论文摘要

在训练分类器之前平衡数据是一种流行的技术，可以解决表格数据中不平衡二进制分类的挑战。平衡通常是通过复制少数样本或产生合成少数族裔样本来实现的。众所周知，平衡对每个分类器的影响都不同，但大多数先前的经验研究并不包括强大的最新（SOTA）分类器作为基准。在这项工作中，我们有兴趣了解平衡是否有益，尤其是在SOTA分类器的背景下。因此，我们进行了广泛的实验，考虑了先前研究中使用的较弱的学习者，考虑了三个SOTA分类器。此外，我们仔细辨别适当的指标，一致和非一致的算法以及超参数选择方法，并表明这些方法对预测质量和平衡有效性产生了重大影响。我们的结果支持已知在弱分类器平衡的效用。但是，我们发现平衡不能改善强大的预测性能。我们进一步确定了平衡有效的其他几种情况，并观察到先前的研究通过专注于这些环境证明了平衡的实用性。

Balancing the data before training a classifier is a popular technique to address the challenges of imbalanced binary classification in tabular data. Balancing is commonly achieved by duplication of minority samples or by generation of synthetic minority samples. While it is well known that balancing affects each classifier differently, most prior empirical studies did not include strong state-of-the-art (SOTA) classifiers as baselines. In this work, we are interested in understanding whether balancing is beneficial, particularly in the context of SOTA classifiers. Thus, we conduct extensive experiments considering three SOTA classifiers along the weaker learners used in previous investigations. Additionally, we carefully discern proper metrics, consistent and non-consistent algorithms and hyper-parameter selection methods and show that these have a significant impact on prediction quality and on the effectiveness of balancing. Our results support the known utility of balancing for weak classifiers. However, we find that balancing does not improve prediction performance for the strong ones. We further identify several other scenarios for which balancing is effective and observe that prior studies demonstrated the utility of balancing by focusing on these settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题