论文标题
关于数据重采样方法的有效性的经验研究
An Empirical Study on the Effectiveness of Data Resampling Approaches for Cross-Project Software Defect Prediction
论文作者
论文摘要
已经提出,使用来自不同软件项目的数据来预测缺陷的Crossp-rokect缺陷预测(CPDP),已被提出是为缺乏历史数据的软件项目提供数据的一种方式。在最近的研究中,使用最近的邻居(NN)滤波器方法对CPDP模型进行评估显示出令人鼓舞的结果。缺陷预测数据集的一个关键挑战是类不平衡,这是高度偏斜的数据集,在该数据集中,非越野车模块主导了货物模块。过去,数据重采样方法已应用于项目内部缺陷预测模型,以帮助减轻数据集中类不平衡的负面影响。为了解决CPDP中类不平衡问题,作者评估了应用NN滤波器在应用NN滤波器后对CPDP模型的影响。对五种过采样方法的预测性能(Mahakil,Smote,Borderline-Smote,随机过采样和Adasyn)的预测性能以及三种未散发采样方法(随机不足采样,TOMEK链接和单个选择)进行了研究,并进行了研究,并将结果与无需进行数据重新采样的方法进行比较。作者在从Promise存储库中提取的34个数据集上检查了六个缺陷预测模型。作者的结果表明,数据重新采样对CPDP性能有显着的积极作用,这表明软件质量团队和研究人员应考虑将数据重新采样方法应用于改进的回忆(PD)和G量级预测性能。但是,如果目标是提高精度并减少错误警报(PF),则应避免数据重采样方法。
Crossp-roject defect prediction (CPDP), where data from different software projects are used to predict defects, has been proposed as a way to provide data for software projects that lack historical data. Evaluations of CPDP models using the Nearest Neighbour (NN) Filter approach have shown promising results in recent studies. A key challenge with defect-prediction datasets is class imbalance, that is highly skewed datasets where non buggy modules dominate the buggy modules. In the past, data resampling approaches have been applied to within-projects defect prediction models to help alleviate the negative effects of class imbalance in the datasets. To address the class imbalance issue in CPDP, the authors assess the impact of data resampling approaches on CPDP models after the NN Filter is applied. The impact on prediction performance of five oversampling approaches (MAHAKIL, SMOTE, Borderline-SMOTE, Random Oversampling, and ADASYN) and three undersampling approaches (Random Undersampling, Tomek Links, and Onesided selection) is investigated and results are compared to approaches without data resampling. The authors' examined six defect prediction models on 34 datasets extracted from the PROMISE repository. The authors results show that there is a significant positive effect of data resampling on CPDP performance, suggesting that software quality teams and researchers should consider applying data resampling approaches for improved recall (pd) and g-measure prediction performance. However if the goal is to improve precision and reduce false alarm (pf) then data resampling approaches should be avoided.