关于使用或验证缺陷预测模型时需要删除数据的最后版本

论文标题

关于使用或验证缺陷预测模型时需要删除数据的最后版本

On the Need of Removing Last Releases of Data When Using or Validating Defect Prediction Models

论文作者

Ahluwalia, Aalok, Di Penta, Massimiliano, Falessi, Davide

论文摘要

为了开发和训练缺陷预测模型，研究人员依靠数据集，其中缺陷归因于工件，例如，给定版本的类别。但是，这样的数据集的创建远非完美。可能会发现缺陷引入后几个释放：这种现象被称为“休眠缺陷”。这意味着，如果我们今天在其当前版本中观察到类的状态，则可以将其视为缺陷，而事实并非如此。我们称之为“打呼”由此类类组成的噪声，仅受休眠缺陷的影响。我们猜想打nor的存在会对分类器的准确性及其评估产生负面影响。此外，较早的版本可能包含比较旧版本的打呼式类别，因此，从数据集中删除最新发行版可以降低打s效应并提高分类器的准确性。在本文中，我们调查了打孔噪声对分类器的准确性及其评估的影响，以及可能的对策的有效性，包括删除最后的数据释放。我们分析了来自Apache生态系统的4,000多个错误和600个发行版的数据，分析了15个机器学习缺陷预测分类器的准确性。我们的结果表明，在整个项目中，平均而言：（i）打nor的存在减少了缺陷预测分类器的回忆；（ii）受打nor的评估可能无法识别最佳分类器，并且（iii）从最近的发行版中删除数据有助于显着提高分类器的准确性。总而言之，本文提供了有关如何通过减轻打nor的效果来创建软件缺陷数据集的见解。

To develop and train defect prediction models, researchers rely on datasets in which a defect is attributed to an artifact, e.g., a class of a given release. However, the creation of such datasets is far from being perfect. It can happen that a defect is discovered several releases after its introduction: this phenomenon has been called "dormant defects". This means that, if we observe today the status of a class in its current version, it can be considered as defect-free while this is not the case. We call "snoring" the noise consisting of such classes, affected by dormant defects only. We conjecture that the presence of snoring negatively impacts the classifiers' accuracy and their evaluation. Moreover, earlier releases likely contain more snoring classes than older releases, thus, removing the most recent releases from a dataset could reduce the snoring effect and improve the accuracy of classifiers. In this paper we investigate the impact of the snoring noise on classifiers' accuracy and their evaluation, and the effectiveness of a possible countermeasure consisting in removing the last releases of data. We analyze the accuracy of 15 machine learning defect prediction classifiers on data from more than 4,000 bugs and 600 releases of 19 open source projects from the Apache ecosystem. Our results show that, on average across projects: (i) the presence of snoring decreases the recall of defect prediction classifiers; (ii) evaluations affected by snoring are likely unable to identify the best classifiers, and (iii) removing data from recent releases helps to significantly improve the accuracy of the classifiers. On summary, this paper provides insights on how to create a software defect dataset by mitigating the effect of snoring.

下载PDF全文

下载文献需遵守相关版权规定

论文标题