重复文本集合的计算MEMS和亲戚

论文标题

重复文本集合的计算MEMS和亲戚

Computing MEMs and Relatives on Repetitive Text Collections

论文作者

Navarro, Gonzalo

论文摘要

我们考虑了计算一个给定模式$ p [1 .. m] $的最大精确匹配（mem）的问题，该$ p [1 .. m] $ $ t [1 .. n] $，该$表示为（希望较小）的（希望较小的）运行长度上下文的语法$ g_ {rl} $。我们表明，对于任何常数$ε> 0 $，在大小$ o（g_ {rl}）$的数据结构上，可以在时间$ o（m^2 \ log^εn）$中解决问题。此外，在大小$ o（δ\ log \ frac {n}δ）$的局部一致语法上，时间减少到$ o（m \ log m（\ log m + m + \ log^εn））$。值$δ$是$ t $和$ω（δ\ log \ frac {n}δ）$的基因复杂性的函数，这是重复文本$ t $的可压缩性的紧密下限，因此我们的结构在$ n $和$δ$方面具有最佳的大小。我们将结果扩展到几个相关的问题，例如查找$ K $ -MEMS，妈妈，稀有MEMS和应用程序。

We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern $P[1 .. m]$ on a large repetitive text collection $T[1 .. n]$, which is represented as a (hopefully much smaller) run-length context-free grammar of size $g_{rl}$. We show that the problem can be solved in time $O(m^2 \log^εn)$, for any constant $ε> 0$, on a data structure of size $O(g_{rl})$. Further, on a locally consistent grammar of size $O(δ\log\frac{n}δ)$, the time decreases to $O(m\log m(\log m + \log^εn))$. The value $δ$ is a function of the substring complexity of $T$ and $Ω(δ\log\frac{n}δ)$ is a tight lower bound on the compressibility of repetitive texts $T$, so our structure has optimal size in terms of $n$ and $δ$. We extend our results to several related problems, such as finding $k$-MEMs, MUMs, rare MEMs, and applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题