论文标题
部分重新采样不平衡数据
Partial Resampling of Imbalanced Data
论文作者
论文摘要
不平衡的数据是机器学习中经常遇到的问题。尽管有关数据不平衡的采样技术有大量文献,但仍有有限的研究解决了最佳抽样比率的问题。在本文中,我们试图通过对抽样比对分类准确性的影响进行大规模研究来填补文献中的空白。我们考虑10种流行的抽样方法,并根据20个数据集评估其性能。数值实验的结果表明,最佳采样比在0.7至0.8之间,尽管确切比率取决于数据集。此外,我们发现,尽管原始不平衡比率或功能数量在确定最佳比率方面没有可见作用,但数据集中的样本数量可能会产生切实效果。
Imbalanced data is a frequently encountered problem in machine learning. Despite a vast amount of literature on sampling techniques for imbalanced data, there is a limited number of studies that address the issue of the optimal sampling ratio. In this paper, we attempt to fill the gap in the literature by conducting a large scale study of the effects of sampling ratio on classification accuracy. We consider 10 popular sampling methods and evaluate their performance over a range of ratios based on 20 datasets. The results of the numerical experiments suggest that the optimal sampling ratio is between 0.7 and 0.8 albeit the exact ratio varies depending on the dataset. Furthermore, we find that while factors such the original imbalance ratio or the number of features do not play a discernible role in determining the optimal ratio, the number of samples in the dataset may have a tangible effect.