分类绩效评估措施考虑数据可分离性

论文标题

分类绩效评估措施考虑数据可分离性

A classification performance evaluation measure considering data separability

论文作者

Xue, Lingyan, Zhang, Xinyu, Jiang, Weidong, Huo, Kai

论文摘要

机器学习和深度学习分类模型是数据驱动的，模型和数据共同确定其分类性能。仅基于分类器准确性评估模型的性能，同时忽略数据可分离性是有偏见的。有时，该模型表现出极好的精度，这可能归因于其对高度可分离数据的测试。当前关于数据可分离性测量方法的大多数研究都是根据样本点之间的距离定义的，但在某些情况下，这已被证明失败了。在本文中，我们提出了一种新的可分离性度量 - 基于数据编码率的可分离性（RS）速率（RS）。我们通过将其与合成数据集的其他四个基于距离的措施进行比较来验证其有效性作为对可分离性度量的补充。然后，我们在由真实数据集构建的多任务场景中证明了所提出的度量和识别精度之间的正相关性。最后，我们讨论了考虑数据可分离性的机器学习和深度学习模型的分类性能的方法。

Machine learning and deep learning classification models are data-driven, and the model and the data jointly determine their classification performance. It is biased to evaluate the model's performance only based on the classifier accuracy while ignoring the data separability. Sometimes, the model exhibits excellent accuracy, which might be attributed to its testing on highly separable data. Most of the current studies on data separability measures are defined based on the distance between sample points, but this has been demonstrated to fail in several circumstances. In this paper, we propose a new separability measure--the rate of separability (RS), which is based on the data coding rate. We validate its effectiveness as a supplement to the separability measure by comparing it to four other distance-based measures on synthetic datasets. Then, we demonstrate the positive correlation between the proposed measure and recognition accuracy in a multi-task scenario constructed from a real dataset. Finally, we discuss the methods for evaluating the classification performance of machine learning and deep learning models considering data separability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题