论文标题
与幼稚贝叶斯上的应用有关的数据离散化的最大率 - 最高差异标准
A Max-relevance-min-divergence Criterion for Data Discretization with Applications on Naive Bayes
论文作者
论文摘要
在许多分类模型中,数据被离散化以更好地估计其分布。现有的离散化方法通常是针对最大化离散数据的判别能力的,同时忽略了分类中数据离散化的主要目标是改善概括性能。结果,数据往往会超出许多小型垃圾箱,因为数据没有离散化保留了最大判别信息。因此,我们提出了一个最大依赖性最低差异(MDMD)标准,该标准最大化了离散数据的判别信息和概括能力。更具体地说,最大依赖性标准可最大化离散数据和分类变量之间的统计依赖性,而最小值标准则明确最大程度地减少了给定离散方案的训练数据与验证数据之间的JS差异。提出的MDMD标准在技术上很有吸引力,但是很难可靠地估计属性的高阶联合分布和分类变量。因此,我们进一步提出了一个更实用的解决方案,最大值 - 差异 - 差异(MRMD)离散方案,其中每个属性通过同时最大化判别信息和离散数据的概括能力分别离散。将提出的MRMD与45个机器学习基准数据集的Naive Bayes分类框架下的最新离散算法进行了比较。它大大优于大多数数据集上的所有比较方法。
In many classification models, data is discretized to better estimate its distribution. Existing discretization methods often target at maximizing the discriminant power of discretized data, while overlooking the fact that the primary target of data discretization in classification is to improve the generalization performance. As a result, the data tend to be over-split into many small bins since the data without discretization retain the maximal discriminant information. Thus, we propose a Max-Dependency-Min-Divergence (MDmD) criterion that maximizes both the discriminant information and generalization ability of the discretized data. More specifically, the Max-Dependency criterion maximizes the statistical dependency between the discretized data and the classification variable while the Min-Divergence criterion explicitly minimizes the JS-divergence between the training data and the validation data for a given discretization scheme. The proposed MDmD criterion is technically appealing, but it is difficult to reliably estimate the high-order joint distributions of attributes and the classification variable. We hence further propose a more practical solution, Max-Relevance-Min-Divergence (MRmD) discretization scheme, where each attribute is discretized separately, by simultaneously maximizing the discriminant information and the generalization ability of the discretized data. The proposed MRmD is compared with the state-of-the-art discretization algorithms under the naive Bayes classification framework on 45 machine-learning benchmark datasets. It significantly outperforms all the compared methods on most of the datasets.