论文标题

可索引弹性创始人图的线性时间构建

Linear Time Construction of Indexable Elastic Founder Graphs

论文作者

Rizzo, Nicola, Mäkinen, Veli

论文摘要

由于其在基因组应用中的重要性,最近对图上的模式匹配进行了广泛的研究。不幸的是,即使决定是否出现字符串作为图形的最简单问题,也可以在正交矢量假设下接受二次下限(Equi等人ICALP 2019,SOFSEM 2021)。为了避免这种瓶颈,研究已转向更具体的图形类别,例如从多个序列比对(MSA)诱导的。考虑分割$ \ MATHSF {MSA} [1..m,1..n] $ to $ b $ blocks $ \ mathsf {MSA} [1..m,1..j_1] $,$ \ MATHSF {MSASF {MSA} [1..M,J_1+1..j_2] $,$ \ ldots $,$ \ ldots $,$ \ ldots $,$, $ \ mathsf {msa} [1..m,j_ {b-1}+1..n] $。在去除间隙符号后,块的行中的不同字符串形成了弹性创始人图(EFG)的节点,其中边缘代表在MSA中观察到的原始连接。如果节点标签作为仅作为从同一块的节点开始的那些路径的前缀,则可以称为索引。 Equi等。 (ISAAC 2021)表明,此类EFG支持快速模式匹配,并给出了$ O(Mn \ log M)$ - 时间算法,用于以一种允许构建可索引的EFG的方式进行预处理MSA,以最大程度地构建块数量,并以$ o(n)$ o(n)$ o(n)$ o(n)$ o(n)$ o(n)的最大长度,以最大程度地减少blong的长度。使用后缀树并在树上解决新的祖先问题,我们将预处理提高到$ O(mn)$ time和$ o(n \ log \ log \ log \ log n)$ - 时间EFG构造到$ o(n)$时间,从而表明两种类型的可索取EFG可以在输入尺寸中构建。

Pattern matching on graphs has been widely studied lately due to its importance in genomics applications. Unfortunately, even the simplest problem of deciding if a string appears as a subpath of a graph admits a quadratic lower bound under the Orthogonal Vectors Hypothesis (Equi et al. ICALP 2019, SOFSEM 2021). To avoid this bottleneck, the research has shifted towards more specific graph classes, e.g. those induced from multiple sequence alignments (MSAs). Consider segmenting $\mathsf{MSA}[1..m,1..n]$ into $b$ blocks $\mathsf{MSA}[1..m,1..j_1]$, $\mathsf{MSA}[1..m,j_1+1..j_2]$, $\ldots$, $\mathsf{MSA}[1..m,j_{b-1}+1..n]$. The distinct strings in the rows of the blocks, after the removal of gap symbols, form the nodes of an elastic founder graph (EFG) where the edges represent the original connections observed in the MSA. An EFG is called indexable if a node label occurs as a prefix of only those paths that start from a node of the same block. Equi et al. (ISAAC 2021) showed that such EFGs support fast pattern matching and gave an $O(mn \log m)$-time algorithm for preprocessing the MSA in a way that allows the construction of indexable EFGs maximizing the number of blocks and, alternatively, minimizing the maximum length of a block, in $O(n)$ and $O(n \log\log n)$ time respectively. Using the suffix tree and solving a novel ancestor problem on trees, we improve the preprocessing to $O(mn)$ time and the $O(n \log \log n)$-time EFG construction to $O(n)$ time, thus showing that both types of indexable EFGs can be constructed in time linear in the input size.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源