Indel通道的接近线性时间编辑距离

论文标题

Indel通道的接近线性时间编辑距离

Near-Linear Time Edit Distance for Indel Channels

论文作者

Ganesh, Arun, Sy, Aaron

论文摘要

我们考虑以下用于采样字符串对的模型：$ s_1 $是长度$ n $的均匀随机斑点，而$ s_2 $是通过对$ s_1 $的每一点施加替换，插入和删除来实现的bittring。我们表明，$ s_1 $和$ s_2 $之间的编辑距离可以以$ o（n \ ln n）$时间计算，只要$ s_1 $的每一点都在其上应用一个突变，最多只有一个小常数。该算法很简单，仅使用教科书动态编程算法作为原始，首先计算两个字符串之间的近似对齐，然后运行动态编程算法，仅限于接近近似对齐的条目。我们算法的分析为实践中使用的对齐启发式方法（例如BLAST，FASTA和MAFFT）提供了理论上的理由，这些启发式方法也从快速计算近似对准开始，然后在近似对齐附近找到最佳对准。我们的主要技术贡献是对对齐的分区，以使分区中的子集数量不大，并且一个子集中的每个对齐都比我们的算法与可能性很高的对齐方式更差。在通常通过动态编程解决的其他问题的平均情况分析中，类似的技术可能会引起人们的关注。

We consider the following model for sampling pairs of strings: $s_1$ is a uniformly random bitstring of length $n$, and $s_2$ is the bitstring arrived at by applying substitutions, insertions, and deletions to each bit of $s_1$ with some probability. We show that the edit distance between $s_1$ and $s_2$ can be computed in $O(n \ln n)$ time with high probability, as long as each bit of $s_1$ has a mutation applied to it with probability at most a small constant. The algorithm is simple and only uses the textbook dynamic programming algorithm as a primitive, first computing an approximate alignment between the two strings, and then running the dynamic programming algorithm restricted to entries close to the approximate alignment. The analysis of our algorithm provides theoretical justification for alignment heuristics used in practice such as BLAST, FASTA, and MAFFT, which also start by computing approximate alignments quickly and then find the best alignment near the approximate alignment. Our main technical contribution is a partitioning of alignments such that the number of the subsets in the partition is not too large and every alignment in one subset is worse than an alignment considered by our algorithm with high probability. Similar techniques may be of interest in the average-case analysis of other problems commonly solved via dynamic programming.

下载PDF全文

下载文献需遵守相关版权规定

论文标题