论文标题

典型但不太可能和正常异常:高维统计背后的直觉

Typical Yet Unlikely and Normally Abnormal: The Intuition Behind High-Dimensional Statistics

论文作者

Vowels, Matthew J.

论文摘要

从历史上讲,正常性在历史上被认为是一种理想的特征,是理想性的代名词。算术平均水平以及扩展的统计数据,包括线性回归系数,通常被用来表征正态性,并且经常被用作总结样本和识别异常值的一种方式。我们提供了此类统计在高维度的行为背后的直觉,并证明即使对于尺寸数量相对较少的数据集,数据也开始表现出许多特殊性,随着维度数量的增加,这些数据集变得严重。尽管我们的主要目标是使研究人员熟悉这些特殊性,但我们还表明,正常性可以通过“典型性”来更好地表征与熵有关的信息理论概念。关于政治价值观的典型性和现实世界中典型性的应用表明,在多维空间中,“正常”实际上是非典型的。我们简要探讨了离群值检测的后果,证明与流行的马哈拉氏症距离相比,典型性是一种可行的分离检测方法。

Normality, in the colloquial sense, has historically been considered an aspirational trait, synonymous with ideality. The arithmetic average and, by extension, statistics including linear regression coefficients, have often been used to characterize normality, and are often used as a way to summarize samples and identify outliers. We provide intuition behind the behavior of such statistics in high dimensions, and demonstrate that even for datasets with a relatively low number of dimensions, data start to exhibit a number of peculiarities which become severe as the number of dimensions increases. Whilst our main goal is to familiarize researchers with these peculiarities, we also show that normality can be better characterized with `typicality', an information theoretic concept relating to entropy. An application of typicality to both synthetic and real-world data concerning political values reveals that in multi-dimensional space, to be `normal' is actually to be atypical. We briefly explore the ramifications for outlier detection, demonstrating how typicality, in contrast with the popular Mahalanobis distance, represents a viable method for outlier detection.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源