集群分析中的跨研究复制性

论文标题

集群分析中的跨研究复制性

Cross-Study Replicability in Cluster Analysis

论文作者

Masoero, Lorenzo, Thomas, Emma, Parmigiani, Giovanni, Tyekucheva, Svitlana, Trippa, Lorenzo

论文摘要

在癌症研究中，聚类技术被广泛用于探索性分析和降低维度，在鉴定新型癌症亚型中起着至关重要的作用，通常对患者管理有直接影响。随着多个研究小组收集的数据的增长，研究聚类程序的可复制性越来越可行，即它们始终如一地恢复几个数据集中生物学上有意义的群集的能力。在本文中，我们回顾了评估聚类分析的可复制性的现有方法，并讨论了评估跨研究聚类可复制性的框架，当有两个或多个研究可用时有用。这些方法可以应用于任何聚类算法，并且可以在分区之间采用不同的相似性来量化可复制性，在全球范围内（即整个样品）以及本地（即单个群集）。使用有关合成和真实基因表达数据的实验，我们说明了可复制性指标的效用，以评估是否在数据集合中始终如一地确定相同的簇。

In cancer research, clustering techniques are widely used for exploratory analyses and dimensionality reduction, playing a critical role in the identification of novel cancer subtypes, often with direct implications for patient management. As data collected by multiple research groups grows, it is increasingly feasible to investigate the replicability of clustering procedures, that is, their ability to consistently recover biologically meaningful clusters across several datasets. In this paper, we review existing methods to assess replicability of clustering analyses, and discuss a framework for evaluating cross-study clustering replicability, useful when two or more studies are available. These approaches can be applied to any clustering algorithm and can employ different measures of similarity between partitions to quantify replicability, globally (i.e. for the whole sample) as well as locally (i.e. for individual clusters). Using experiments on synthetic and real gene expression data, we illustrate the utility of replicability metrics to evaluate if the same clusters are identified consistently across a collection of datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题