论文标题

在内存约束下进行数据流分类的Mondrian森林

Mondrian Forest for Data Stream Classification Under Memory Constraints

论文作者

Khannouz, Martin, Glatard, Tristan

论文摘要

监督的学习算法通常假设有足够的内存能够在培训和测试阶段存储其数据模型。但是,在物联网中,当数据以无限数据流的形式出现,或者将学习算法部署在具有减少内存量的设备上时,此假设是不现实的。在本文中,我们调整了在线蒙德里亚森林分类算法,以在数据流上处理内存约束。特别是,我们设计了五种失调的策略,以在达到内存限制时使用新数据点更新蒙德里安树。此外,我们设计了修剪机制,使蒙德里亚树在记忆约束下概念更加坚固。我们在各种真实和模拟的数据集上评估了算法,并以有关它们在不同情况下使用的建议得出的结论:扩展节点策略在所有配置中都是最佳的遗传策略,而应根据预期是否概念漂移来采用不同的修剪机制。我们所有的方法均在ORPAILLECC开源库中实现,并准备在嵌入式系统和连接的对象上使用。

Supervised learning algorithms generally assume the availability of enough memory to store their data model during the training and test phases. However, in the Internet of Things, this assumption is unrealistic when data comes in the form of infinite data streams, or when learning algorithms are deployed on devices with reduced amounts of memory. In this paper, we adapt the online Mondrian forest classification algorithm to work with memory constraints on data streams. In particular, we design five out-of-memory strategies to update Mondrian trees with new data points when the memory limit is reached. Moreover, we design trimming mechanisms to make Mondrian trees more robust to concept drifts under memory constraints. We evaluate our algorithms on a variety of real and simulated datasets, and we conclude with recommendations on their use in different situations: the Extend Node strategy appears as the best out-of-memory strategy in all configurations, whereas different trimming mechanisms should be adopted depending on whether a concept drift is expected. All our methods are implemented in the OrpailleCC open-source library and are ready to be used on embedded systems and connected objects.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源