Phoni：带有多基因组参考的流式匹配统计数据

论文标题

Phoni：带有多基因组参考的流式匹配统计数据

PHONI: Streamed Matching Statistics with Multi-Genome References

论文作者

Boucher, Christina, Gagie, Travis, I, Tomohiro, Köppl, Dominik, Langmead, Ben, Manzini, Giovanni, Navarro, Gonzalo, Pacheco, Alejandro, Rossi, Massimiliano

论文摘要

计算有关文本模式的匹配统计数据是生物信息学的基本任务，但是当文本是一个高度压缩的基因组数据库时，它是可怕的任务。 Bannai等。为这种情况提供了有效的解决方案，Rossi等人。最近实现了，但是在第一个通过期间，它使用了两次通过模式并缓冲每个角色的指针。在本文中，我们简化了他们的解决方案并使其流式传输，以稍微放慢速度。这意味着，首先，我们可以并行计算几种长模式（例如整个人类染色体）的匹配统计数据，同时仍使用合理量的RAM。其次，我们可以以低延迟来计算在线匹配统计信息，从而迅速认识到相对于数据库不可压缩的何时变得不可压缩。

Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database.

下载PDF全文

下载文献需遵守相关版权规定

论文标题