论文标题
Divik:大型生物学数据
DiviK: Divisive intelligent K-Means for hands-free unsupervised clustering in big biological data
论文作者
论文摘要
研究分子异质性提供了有关肿瘤起源和代谢组学的见解。收集的数据量增加使手动分析不可行 - 因此,使用自动的无监督学习方法来发现异质性。但是,自动监督分析需要大量的经验来设置其超参数,并且通常对预期子结构的数量有预先了解。此外,许多测得的分子需要额外的功能工程步骤,以提供宝贵的结果。在这项工作中,我们提出了Divik:一种可扩展的逐步算法,具有局部数据驱动的特征空间适应性,用于分割高维数据集。三个质量指数的组合:骰子指数,兰德指数和EXIMS评分用于评估3D空间中无监督分析的质量。 Divik在2D和3D中通过质谱成像获得的两个单独的高通量数据集进行了验证。 Divik可能是在质谱成像数据初始探索期间要考虑的默认选择之一。它提供了绝对异质性检测和专注于生物学上合理结构的权衡,并且不需要在分析前指定预期结构的数量。凭借其独特的本地特征空间适应,它在重点关注细节时可抵抗主导全局模式。最后,由于其简单性,Divik很容易被推广到更灵活的框架,该框架对其他“ - 组”数据或通常的表格数据有用(包括适当嵌入后的医学图像)。 https://github.com/gmrukwa/divik在Apache 2.0许可证中免费获得了通用实现。
Investigating molecular heterogeneity provides insights about tumor origin and metabolomics. The increasing amount of data gathered makes manual analyses infeasible - therefore, automated unsupervised learning approaches are utilized for discovering heterogeneity. However, automated unsupervised analyses require a lot of experience with setting their hyperparameters and usually an upfront knowledge about the number of expected substructures. Moreover, numerous measured molecules require an additional step of feature engineering to provide valuable results. In this work, we propose DiviK: a scalable stepwise algorithm with local data-driven feature space adaptation for the segmentation of high-dimensional datasets. The combination of three quality indices: Dice Index, Rand Index and EXIMS score are used to assess the quality of unsupervised analyses in 3D space. DiviK was validated on two separate high-throughput datasets acquired by Mass Spectrometry Imaging in 2D and 3D. DiviK could be one of the default choices to consider during the initial exploration of Mass Spectrometry Imaging data. It provides a trade-off between absolute heterogeneity detection and focus on biologically plausible structures, and does not require specifying the number of expected structures before the analysis. With its unique local feature space adaptation, it is robust against dominating global patterns when focusing on the detail. Finally, due to its simplicity, DiviK is easily generalizable to an even more flexible framework, useful for other '-omics' data, or tabular data in general (including medical images after appropriate embedding). A generic implementation is freely available under Apache 2.0 license at https://github.com/gmrukwa/divik.