基因组数据分析的平行性基序

论文标题

基因组数据分析的平行性基序

The Parallelism Motifs of Genomic Data Analysis

论文作者

Yelick, Katherine, Buluc, Aydin, Awan, Muaaz, Azad, Ariful, Brock, Benjamin, Egan, Rob, Ekanayake, Saliya, Ellis, Marquita, Georganas, Evangelos, Guidi, Giulia, Hofmeyr, Steven, Selvitopi, Oguz, Teodoropol, Cristina, Oliker, Leonid

论文摘要

随着测序的成本不断下降，基因组数据集的增长急剧增长，并且可以使用小型测序设备。庞大的社区数据库存储并与研究社区共享此数据，但是其中一些基因组数据分析问题需要大规模计算平台以满足记忆和计算要求。这些应用程序与当今高端平行系统上的工作量主导的科学模拟不同，并在编程支持，软件库和并行体系结构设计上提出了不同的要求。例如，它们涉及不规则的通信模式，例如对共享数据结构的异步更新。我们考虑了高性能基因组学分析中的几个问题，包括单个基因组和元基因组的对齐，分析，聚类和组装。我们确定了一些常见的计算模式或主题，以帮助并行化策略，并将我们的主题与一些已建立的列表进行比较，认为至少存在两个关键模式，分类和哈希是缺少的。

Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing.

下载PDF全文

下载文献需遵守相关版权规定

论文标题