论文标题
调整调整后的兰德指数 - 多项式故事
Adjusting the adjusted Rand Index -- A multinomial story
论文作者
论文摘要
调整后的兰德指数($ ari $)可以说是群集比较的最受欢迎的措施之一。 $ ari $的调整是基于一个超几何分布假设,从建模的角度来看,它不令人满意,因为(i)当两个簇依赖时,它是不合适的,(ii)迫使簇的大小,(iii)它忽略了采样的随机性。在这项工作中,我们提出了RAND索引的新“修改”版本。首先,我们仅通过相似性来计算对成对的一致并通过差异一致,从而提高了分数的解释性,从而重新定义了$ MRI $。其次,我们将调整后的版本($ mari $)基于多项式分布,而不是高几何分布。多项式模型是有利的,因为它不迫使簇的大小,正确模型随机性,并且很容易扩展到依赖情况。我们表明,$ ari $在多项式型号下是有偏见的,并且对于小$ n $而言,$ ari $和$ mari $之间的差异可能很大,但对于大$ n $而言,$ n $是个人的数量。最后,我们提供了一种有效的算法来计算所有这些数量($(a)ri $和$ m(a)ri $),通过依靠我们的\ texttt {aricode} package in contingency表的稀疏表示。空间和时间复杂性在样本数量中是线性的,重要的是,由于我们没有明确计算偶性表,因此不取决于群集的数量。
The Adjusted Rand Index ($ARI$) is arguably one of the most popular measures for cluster comparison. The adjustment of the $ARI$ is based on a hypergeometric distribution assumption which is unsatisfying from a modeling perspective as (i) it is not appropriate when the two clusterings are dependent, (ii) it forces the size of the clusters, and (iii) it ignores randomness of the sampling. In this work, we present a new "modified" version of the Rand Index. First, we redefine the $MRI$ by only counting the pairs consistent by similarity and ignoring the pairs consistent by difference, increasing the interpretability of the score. Second, we base the adjusted version, $MARI$, on a multinomial distribution instead of a hypergeometric distribution. The multinomial model is advantageous as it does not force the size of the clusters, properly models randomness, and is easily extended to the dependant case. We show that the $ARI$ is biased under the multinomial model and that the difference between the $ARI$ and $MARI$ can be large for small $n$ but essentially vanish for large $n$, where $n$ is the number of individuals. Finally, we provide an efficient algorithm to compute all these quantities ($(A)RI$ and $M(A)RI$) by relying on a sparse representation of the contingency table in our \texttt{aricode} package. The space and time complexity is linear in the number of samples and importantly does not depend on the number of clusters as we do not explicitly compute the contingency table.