论文标题

更快的近似模式匹配:统一方法

Faster Approximate Pattern Matching: A Unified Approach

论文作者

Charalampopoulos, Panagiotis, Kociumaka, Tomasz, Wellnitz, Philip

论文摘要

近似模式匹配是字符串上的自然且研究的问题:给定文本$ t $,图案$ p $和一个阈值$ k $,查找(启动位置)$ t $的所有子字符串,最多在$ k $的$ p $。我们考虑两个最基本的弦乐指标:锤距离和编辑距离。在锤子距离下,我们搜索最多与$ p $的$ k $不匹配的$ t $的子字符串,而在编辑距离下,我们搜索可以将$ t $的子字符串与最多$ k $ edits一起转换为$ p $。 $ t $中的$ p $的确切出现具有非常简单的结构:如果我们为简单起见$ | t | \ le 3 | p |/2 $和Trim $ t $,因此$ P $既是前缀又是$ t $的后缀,然后$ p $和$ t $都是常规的,有一个常见的时间。然而,Bringmann等人最近才证明了最高$ K $不匹配的发生结构的类似表征。 [SODA'19]:要么有$ O(K^2)$ $ K $ - MISMATCH $ t $中的$ p $,要么$ p $ t $,或者$ p $和$ t $均为hamming距离$ o(k)$(k)$,带有普通周期$ o(m/k)$。我们通过证明在模式不(大约)周期性的情况下出现$ o(k)$ $ $ k $ - miSMATCH的情况,并将其提升到编辑距离设置,在非生育情况下,我们将$ o(k^2)$紧密地限制了$ o(k)$ $ $ k $ - $ miSMISMATCH的情况。我们的证明是建设性的,让我们获得一个统一的框架,以供两个所考虑的距离进行近似模式匹配。我们展示了框架的一般性,并显示了完全压缩的设置(其中$ t $和$ p $作为直线程序)和动态设置(我们扩展了Gawrychowski等人的数据结构[Soda'18])。

Approximate pattern matching is a natural and well-studied problem on strings: Given a text $T$, a pattern $P$, and a threshold $k$, find (the starting positions of) all substrings of $T$ that are at distance at most $k$ from $P$. We consider the two most fundamental string metrics: the Hamming distance and the edit distance. Under the Hamming distance, we search for substrings of $T$ that have at most $k$ mismatches with $P$, while under the edit distance, we search for substrings of $T$ that can be transformed to $P$ with at most $k$ edits. Exact occurrences of $P$ in $T$ have a very simple structure: If we assume for simplicity that $|T| \le 3|P|/2$ and trim $T$ so that $P$ occurs both as a prefix and as a suffix of $T$, then both $P$ and $T$ are periodic with a common period. However, an analogous characterization for the structure of occurrences with up to $k$ mismatches was proved only recently by Bringmann et al. [SODA'19]: Either there are $O(k^2)$ $k$-mismatch occurrences of $P$ in $T$, or both $P$ and $T$ are at Hamming distance $O(k)$ from strings with a common period $O(m/k)$. We tighten this characterization by showing that there are $O(k)$ $k$-mismatch occurrences in the case when the pattern is not (approximately) periodic, and we lift it to the edit distance setting, where we tightly bound the number of $k$-edit occurrences by $O(k^2)$ in the non-periodic case. Our proofs are constructive and let us obtain a unified framework for approximate pattern matching for both considered distances. We showcase the generality of our framework with results for the fully-compressed setting (where $T$ and $P$ are given as a straight-line program) and for the dynamic setting (where we extend a data structure of Gawrychowski et al. [SODA'18]).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源