论文标题
ECOD:使用经验累积分布函数的无监督分离器检测
ECOD: Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions
论文作者
论文摘要
异常检测是指鉴定偏离常规数据分布的数据点。现有的无监督方法通常会遭受高计算成本,复杂的高参数调整和有限的解释性的影响,尤其是在使用大型高维数据集时。为了解决这些问题,我们提出了一种简单而有效的算法,称为ECOD(基于经验 - 分布的异常检测),这是受到以下事实的启发:离群值通常是出现在分布尾巴中的“罕见事件”。简而言之,ECOD首先通过计算数据的经验累积分布来以非参数方式估算输入数据的潜在分布。然后,ECOD使用这些经验分布来估计每个数据点每个维度的尾巴概率。最后,ECOD通过汇总跨维度的估计尾巴概率来计算每个数据点的异常得分。我们的贡献如下:(1)我们提出了一种称为ECOD的新型离群检测方法,既无参数又易于解释; (2)我们在30个基准数据集上进行了广泛的实验,在该数据集中,我们发现ECOD在准确性,效率和可扩展性方面优于11个最先进的基准; (3)我们发布了易于使用且可扩展的(带有分布式支持)Python实现,以实现可访问性和可重复性。
Outlier detection refers to the identification of data points that deviate from a general data distribution. Existing unsupervised approaches often suffer from high computational cost, complex hyperparameter tuning, and limited interpretability, especially when working with large, high-dimensional datasets. To address these issues, we present a simple yet effective algorithm called ECOD (Empirical-Cumulative-distribution-based Outlier Detection), which is inspired by the fact that outliers are often the "rare events" that appear in the tails of a distribution. In a nutshell, ECOD first estimates the underlying distribution of the input data in a nonparametric fashion by computing the empirical cumulative distribution per dimension of the data. ECOD then uses these empirical distributions to estimate tail probabilities per dimension for each data point. Finally, ECOD computes an outlier score of each data point by aggregating estimated tail probabilities across dimensions. Our contributions are as follows: (1) we propose a novel outlier detection method called ECOD, which is both parameter-free and easy to interpret; (2) we perform extensive experiments on 30 benchmark datasets, where we find that ECOD outperforms 11 state-of-the-art baselines in terms of accuracy, efficiency, and scalability; and (3) we release an easy-to-use and scalable (with distributed support) Python implementation for accessibility and reproducibility.