更快的近似模式匹配：统一方法

论文标题

更快的近似模式匹配：统一方法

Faster Approximate Pattern Matching: A Unified Approach

论文作者

Charalampopoulos, Panagiotis, Kociumaka, Tomasz, Wellnitz, Philip

论文摘要

近似模式匹配是字符串上的自然且研究的问题：给定文本$ t $，图案$ p $和一个阈值$ k $，查找（启动位置）$ t $的所有子字符串，最多在$ k $的$ p $。我们考虑两个最基本的弦乐指标：锤距离和编辑距离。在锤子距离下，我们搜索最多与$ p $的$ k $不匹配的$ t $的子字符串，而在编辑距离下，我们搜索可以将$ t $的子字符串与最多$ k $ edits一起转换为$ p $。 $ t $中的$ p $的确切出现具有非常简单的结构：如果我们为简单起见$ | t | \ le 3 | p |/2 $和Trim $ t $，因此$ P $既是前缀又是$ t $的后缀，然后$ p $和$ t $都是常规的，有一个常见的时间。然而，Bringmann等人最近才证明了最高$ K $不匹配的发生结构的类似表征。 [SODA'19]：要么有$ O（K^2）$ $ K $ - MISMATCH $ t $中的$ p $，要么$ p $ t $，或者$ p $和$ t $均为hamming距离$ o（k）$（k）$，带有普通周期$ o（m/k）$。我们通过证明在模式不（大约）周期性的情况下出现$ o（k）$ $ $ k $ - miSMATCH的情况，并将其提升到编辑距离设置，在非生育情况下，我们将$ o（k^2）$紧密地限制了$ o（k）$ $ $ k $ - $ miSMISMATCH的情况。我们的证明是建设性的，让我们获得一个统一的框架，以供两个所考虑的距离进行近似模式匹配。我们展示了框架的一般性，并显示了完全压缩的设置（其中$ t $和$ p $作为直线程序）和动态设置（我们扩展了Gawrychowski等人的数据结构[Soda'18]）。

Approximate pattern matching is a natural and well-studied problem on strings: Given a text $T$, a pattern $P$, and a threshold $k$, find (the starting positions of) all substrings of $T$ that are at distance at most $k$ from $P$. We consider the two most fundamental string metrics: the Hamming distance and the edit distance. Under the Hamming distance, we search for substrings of $T$ that have at most $k$ mismatches with $P$, while under the edit distance, we search for substrings of $T$ that can be transformed to $P$ with at most $k$ edits. Exact occurrences of $P$ in $T$ have a very simple structure: If we assume for simplicity that $|T| \le 3|P|/2$ and trim $T$ so that $P$ occurs both as a prefix and as a suffix of $T$, then both $P$ and $T$ are periodic with a common period. However, an analogous characterization for the structure of occurrences with up to $k$ mismatches was proved only recently by Bringmann et al. [SODA'19]: Either there are $O(k^2)$ $k$-mismatch occurrences of $P$ in $T$, or both $P$ and $T$ are at Hamming distance $O(k)$ from strings with a common period $O(m/k)$. We tighten this characterization by showing that there are $O(k)$ $k$-mismatch occurrences in the case when the pattern is not (approximately) periodic, and we lift it to the edit distance setting, where we tightly bound the number of $k$-edit occurrences by $O(k^2)$ in the non-periodic case. Our proofs are constructive and let us obtain a unified framework for approximate pattern matching for both considered distances. We showcase the generality of our framework with results for the fully-compressed setting (where $T$ and $P$ are given as a straight-line program) and for the dynamic setting (where we extend a data structure of Gawrychowski et al. [SODA'18]).

下载PDF全文

下载文献需遵守相关版权规定

论文标题