论文标题

计算Phylo-K-Mers

Computing Phylo-k-mers

论文作者

Romashchenko, Nikolai, Linard, Benjamin, Pardi, Fabio, Rivals, Eric

论文摘要

系统发育知情的K-MER或phylo-k-Mers的简称为K-MER,这些K-MER被预测出现在给定的基因组区域内,在固定系统发育的预定位置。给定该基因组区域的参考比对并假设序列演化的系统发育模型,我们可以在任何给定的树节点下计算任何给定的K-MER的概率得分。具有足够高概率的K-MER可用于执行新序列的无比对系统发育分类 - 最近提出的用于元法编码读数的系统发育放置并检测新型病毒重组剂。在计算Phylo-K-Mers时,我们需要在每个树节点上考虑大量的K-Mers,这值得开发有效的枚举算法。我们考虑对Phylo-K-MER计算问题的正式定义:如何有效地找到所有概率位于给定树节点的用户定义阈值以上的K-MER?我们依靠分支机构和分裂技术来描述和分析此问题的算法。我们利用对齐方式的相邻窗口的冗余和概率矩阵的结构来保存计算。除了进行计算复杂性分析外,我们还对其在现实世界和模拟数据上实现的相对性能提供了经验评估。据我们所知,划分和诱导算法是对分支和结合方法的明显改进,尤其是当发现大量的phylo-k-Mers时。

Phylogenetically informed k-mers, or phylo-k-mers for short, are k-mers that are predicted to appear within a given genomic region at predefined locations of a fixed phylogeny. Given a reference alignment for this genomic region and assuming a phylogenetic model of sequence evolution, we can compute a probability score for any given k-mer at any given tree node. The k-mers with sufficiently high probabilities can later be used to perform alignment-free phylogenetic classification of new sequences-a procedure recently proposed for the phylogenetic placement of metabarcoding reads and the detection of novel virus recombinants. While computing phylo-k-mers, we need to consider large numbers of k-mers at each tree node, which warrants the development of efficient enumeration algorithms. We consider a formal definition of the problem of phylo-k-mer computation: How to efficiently find all k-mers whose probability lies above a user-defined threshold for a given tree node? We describe and analyze algorithms for this problem, relying on branch-and-bound and divideand-conquer techniques. We exploit the redundancy of adjacent windows of the alignment and the structure of the probability matrix to save on computation. Besides computational complexity analyses, we provide an empirical evaluation of the relative performance of their implementations on real-world and simulated data. The divide-and-conquer algorithms, which to the best of our knowledge are novel, are found to be clear improvements over the branch-and-bound approach, especially when a large number of phylo-k-mers are found.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源