论文标题

评估软件缺陷预测性能:为什么使用Matthews相关系数重要

Assessing Software Defection Prediction Performance: Why Using the Matthews Correlation Coefficient Matters

论文作者

Yao, Jingxiu, Shepperd, Martin

论文摘要

上下文:计算实验的范围和设计有相当多的多样性,以评估软件缺陷预测的分类器。关于分类器性能指标的选择,尤其如此。不幸的是,已知一些广泛使用的指标是有偏见的,特别是F1。目的:我们想了解F1在软件缺陷预测中广泛使用F1的经验结果的程度。方法:我们搜索了报告F1和Matthews相关系数(MCC)的缺陷预测研究。这使我们能够确定指标和变化比例之间一致的结果比例。结果:我们的系统评价确定了包括4017个成对结果的8项研究。在这些结果中,比较的方向在采用公正的MCC度量的情况下发生了变化。结论:我们找到了令人信服的原因,为什么应弃用分类性能指标的选择,特别是偏见和误导的F1度量。

Context: There is considerable diversity in the range and design of computational experiments to assess classifiers for software defect prediction. This is particularly so, regarding the choice of classifier performance metrics. Unfortunately some widely used metrics are known to be biased, in particular F1. Objective: We want to understand the extent to which the widespread use of the F1 renders empirical results in software defect prediction unreliable. Method: We searched for defect prediction studies that report both F1 and the Matthews correlation coefficient (MCC). This enabled us to determine the proportion of results that are consistent between both metrics and the proportion that change. Results: Our systematic review identifies 8 studies comprising 4017 pairwise results. Of these results, the direction of the comparison changes in 23% of the cases when the unbiased MCC metric is employed. Conclusion: We find compelling reasons why the choice of classification performance metric matters, specifically the biased and misleading F1 metric should be deprecated.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源