论文标题

使用现实世界数据基准基准学习算法的挑战

Challenges in Benchmarking Stream Learning Algorithms with Real-world Data

论文作者

Souza, Vinicius M. A., Reis, Denis M. dos, Maletzke, Andre G., Batista, Gustavo E. A. P. A.

论文摘要

流媒体数据越来越多地存在于现实世界中的应用程序,例如传感器测量,卫星数据供稿,股票市场和财务数据。这些应用程序的主要特征是高速数据观察结果的在线到达以及由于真实环境的动态性质而导致数据分布变化的敏感性。数据流挖掘社区仍然面临与新建议的比较和评估有关的一些主要挑战和困难,这主要是由于缺乏公开可用的非平稳现实世界数据集。文献中提出的流算法的比较并不是一件容易的事,因为作者并不总是遵循相同的建议,实验评估程序,数据集和假设。在本文中,我们减轻了与数据集分类器和漂移探测器的实验评估中的选择有关的问题。为此,我们提出了一个新的公共数据存储库,用于使用现实世界数据进行基准测试流算法。该存储库包含来自文献和与高度相关的公共卫生问题有关的最受欢迎的数据集,涉及使用光学传感器识别疾病媒介昆虫的数据集。这些新数据集的主要优点是对它们的特征和变化模式的先验知识,以充分评估新的自适应算法建议。我们还对导致数据分布变化的特征,原因和问题进行了深入讨论,以及对文献中可用的当前基准数据集的常见问题的批判性审查。

Streaming data are increasingly present in real-world applications such as sensor measurements, satellite data feed, stock market, and financial data. The main characteristics of these applications are the online arrival of data observations at high speed and the susceptibility to changes in the data distributions due to the dynamic nature of real environments. The data stream mining community still faces some primary challenges and difficulties related to the comparison and evaluation of new proposals, mainly due to the lack of publicly available non-stationary real-world datasets. The comparison of stream algorithms proposed in the literature is not an easy task, as authors do not always follow the same recommendations, experimental evaluation procedures, datasets, and assumptions. In this paper, we mitigate problems related to the choice of datasets in the experimental evaluation of stream classifiers and drift detectors. To that end, we propose a new public data repository for benchmarking stream algorithms with real-world data. This repository contains the most popular datasets from literature and new datasets related to a highly relevant public health problem that involves the recognition of disease vector insects using optical sensors. The main advantage of these new datasets is the prior knowledge of their characteristics and patterns of changes to evaluate new adaptive algorithm proposals adequately. We also present an in-depth discussion about the characteristics, reasons, and issues that lead to different types of changes in data distribution, as well as a critical review of common problems concerning the current benchmark datasets available in the literature.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源