论文标题

在火花下通过迁移策略选择可扩展特征子集选择的平行双目标进化算法

Parallel bi-objective evolutionary algorithms for scalable feature subset selection via migration strategy under Spark

论文作者

Vivek, Yelleti, Ravi, Vadlamani, Krishna, P. Radha

论文摘要

用于分类的特征子集选择(FSS)本质上是一个双目标优化问题,其中任务是获得特征子集,该特征子集在接收器操作员特征曲线(AUC)下,具有特征子集的最小基数。在当今的世界中,在人类的所有活动中都会产生大量的数据。为了挖掘这种大量数据,通常是高维的,需要开发平行可扩展的框架。在首次研究的研究中,我们提出并开发了一个基于迭代的MAPREDUCE框架,用于基于Apache Spark下的基于Bi-Optive进化算法(EAS)的包装器,并具有迁移策略。 In order to accomplish this, we parallelized the non-dominated sorting based algorithms namely non dominated sorting algorithm (NSGA-II), and non-dominated sorting particle swarm optimization (NSPSO), also the decomposition-based algorithm, namely the multi-objective evolutionary algorithm based on decomposition (MOEA-D), and named them P-NSGA-II-IS, p-nspso-is,p-moea-d-is。我们通过合并非主导的排序原理,同时并行化它,提出了修改的MOEA-D。在整个研究中,AUC是通过逻辑回归(LR)计算的。我们测试了在各种数据集上提出的方法的有效性。值得注意的是,P-NSGA-II通过在大多数数据集的前2个位置中处于前2个位置,这在统计上很重要。我们还报告了通过最重复的特征子集获得的经验成就图,加快分析和平均AUC,并使用Hypervolume的最高AUC和具有最高AUC的主要基数子集以及多样性分析。

Feature subset selection (FSS) for classification is inherently a bi-objective optimization problem, where the task is to obtain a feature subset which yields the maximum possible area under the receiver operator characteristic curve (AUC) with minimum cardinality of the feature subset. In todays world, a humungous amount of data is generated in all activities of humans. To mine such voluminous data, which is often high-dimensional, there is a need to develop parallel and scalable frameworks. In the first-of-its-kind study, we propose and develop an iterative MapReduce-based framework for bi-objective evolutionary algorithms (EAs) based wrappers under Apache spark with the migration strategy. In order to accomplish this, we parallelized the non-dominated sorting based algorithms namely non dominated sorting algorithm (NSGA-II), and non-dominated sorting particle swarm optimization (NSPSO), also the decomposition-based algorithm, namely the multi-objective evolutionary algorithm based on decomposition (MOEA-D), and named them P-NSGA-II-IS, P-NSPSO-IS, P-MOEA-D-IS, respectively. We proposed a modified MOEA-D by incorporating the non-dominated sorting principle while parallelizing it. Throughout the study, AUC is computed by logistic regression (LR). We test the effectiveness of the proposed methodology on various datasets. It is noteworthy that the P-NSGA-II turns out to be statistically significant by being in the top 2 positions on most datasets. We also reported the empirical attainment plots, speed up analysis, and mean AUC obtained by the most repeated feature subset and the least cardinal feature subset with the highest AUC, and diversity analysis using hypervolume.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源