论文标题
用于基于质谱的OMICS数据的分布式内存计算的通信较低。
Communication Lower-Bounds for Distributed-Memory Computations for Mass Spectrometry based Omics Data
论文作者
论文摘要
基于质谱(MS)的OMIC数据分析需要大量的时间和资源。迄今为止,很少有人提出平行算法来从基于质谱的数据中推导肽。但是,这些平行算法的设计并开发了,当需要处理的数据量较小时。 In this paper, we prove that the communication bound that is reached by the \emph{existing} parallel algorithms is $Ω(mn+2r\frac{q}{p})$, where $m$ and $n$ are the dimensions of the theoretical database matrix, $q$ and $r$ are dimensions of spectra, and $p$ is the number of processors.我们进一步证明了使用快速启示$ \ sqrt {m} = mn + \ frac {2qr} {p} $可以实现$ω({\ frac {2mnq} {p}})$可以实现$ω({\ frac {2mnq}})$可以实现任何现有的平行蛋白质组algorith tolgormith tollgormith tolgt ilgormits tollgormith,我们还可以实现通信 - 最佳策略。为了验证我们的主张,我们对已发表的平行算法及其性能结果进行了荟萃分析。我们表明,随着处理器数量越来越多的次优速度是无法实现较低范围的直接结果。我们通过执行实验进一步验证我们的主张,这些实验证明了本文证明的沟通范围。因此,我们断言\ emph {provable}的下一代,并且在基于MS的大型系统生物学研究中迫切需要表现出优质的平行算法,尤其是用于元蛋白质组学,蛋白质组学,微生物组和蛋白质组学的非型生物体。我们的希望是,本文将激发并行的计算社区,以进一步研究基于MS的高度影响的OMICS问题的并行算法。
Mass spectrometry (MS) based omics data analysis require significant time and resources. To date, few parallel algorithms have been proposed for deducing peptides from mass spectrometry-based data. However, these parallel algorithms were designed, and developed when the amount of data that needed to be processed was smaller in scale. In this paper, we prove that the communication bound that is reached by the \emph{existing} parallel algorithms is $Ω(mn+2r\frac{q}{p})$, where $m$ and $n$ are the dimensions of the theoretical database matrix, $q$ and $r$ are dimensions of spectra, and $p$ is the number of processors. We further prove that communication-optimal strategy with fast-memory $\sqrt{M} = mn + \frac{2qr}{p}$ can achieve $Ω({\frac{2mnq}{p}})$ but is not achieved by any existing parallel proteomics algorithms till date. To validate our claim, we performed a meta-analysis of published parallel algorithms, and their performance results. We show that sub-optimal speedups with increasing number of processors is a direct consequence of not achieving the communication lower-bounds. We further validate our claim by performing experiments which demonstrate the communication bounds that are proved in this paper. Consequently, we assert that next-generation of \emph{provable}, and demonstrated superior parallel algorithms are urgently needed for MS based large systems-biology studies especially for meta-proteomics, proteogenomic, microbiome, and proteomics for non-model organisms. Our hope is that this paper will excite the parallel computing community to further investigate parallel algorithms for highly influential MS based omics problems.