论文标题

从数据流不平衡学习的调查:分类法,挑战,实证研究和可重现的实验框架

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

论文作者

Aguiar, Gabriel, Krawczyk, Bartosz, Cano, Alberto

论文摘要

在对数据流进行分类时,类不平衡会带来新的挑战。最近,文献中提出的许多算法使用各种数据级,算法级别和集合方法解决了这个问题。但是,缺乏有关如何评估这些算法的标准化和商定的程序和基准。这项工作提出了一个标准化,详尽和全面的实验框架,以评估各种不同的数据流场景集合中的算法。实验研究评估了515个不平衡数据流的24个最先进的数据流算法,这些数据流相结合了静态和动态类不平衡比率,实例级别的难度,概念漂移,现实世界和半合成数据集中的综合和多阶段风景。这导致了一项大规模的实验研究,比较了数据流挖掘域中最先进的分类器。我们在每种情况下讨论了最先进的分类器的优点和缺点,并为最终用户提供一般建议,以选择不平衡数据流的最佳算法。此外,我们为此领域提出了开放的挑战和未来的方向。我们的实验框架是完全可重现的,易于使用新方法扩展。这样,我们提出了一种标准化的方法,用于在数据流中进行实验,这些方法可以被其他研究人员使用,以创建对新提出的方法的完整,值得信赖和公平的评估。我们的实验框架可以从https://github.com/canoalberto/imbalanced-streams下载。

Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures and benchmarks on how to evaluate these algorithms. This work proposes a standardized, exhaustive, and comprehensive experimental framework to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to a large-scale experimental study comparing state-of-the-art classifiers in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental framework is fully reproducible and easy to extend with new methods. This way, we propose a standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create complete, trustworthy, and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from https://github.com/canoalberto/imbalanced-streams.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源