论文标题

重复文本集合的计算MEMS和亲戚

Computing MEMs and Relatives on Repetitive Text Collections

论文作者

Navarro, Gonzalo

论文摘要

我们考虑了计算一个给定模式$ p [1 .. m] $的最大精确匹配(mem)的问题,该$ p [1 .. m] $ $ t [1 .. n] $,该$表示为(希望较小)的(希望较小的)运行长度上下文的语法$ g_ {rl} $。我们表明,对于任何常数$ε> 0 $,在大小$ o(g_ {rl})$的数据结构上,可以在时间$ o(m^2 \ log^εn)$中解决问题。此外,在大小$ o(δ\ log \ frac {n}δ)$的局部一致语法上,时间减少到$ o(m \ log m(\ log m + m + \ log^εn))$。值$δ$是$ t $和$ω(δ\ log \ frac {n}δ)$的基因复杂性的函数,这是重复文本$ t $的可压缩性的紧密下限,因此我们的结构在$ n $和$δ$方面具有最佳的大小。我们将结果扩展到几个相关的问题,例如查找$ K $ -MEMS,妈妈,稀有MEMS和应用程序。

We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern $P[1 .. m]$ on a large repetitive text collection $T[1 .. n]$, which is represented as a (hopefully much smaller) run-length context-free grammar of size $g_{rl}$. We show that the problem can be solved in time $O(m^2 \log^εn)$, for any constant $ε> 0$, on a data structure of size $O(g_{rl})$. Further, on a locally consistent grammar of size $O(δ\log\frac{n}δ)$, the time decreases to $O(m\log m(\log m + \log^εn))$. The value $δ$ is a function of the substring complexity of $T$ and $Ω(δ\log\frac{n}δ)$ is a tight lower bound on the compressibility of repetitive texts $T$, so our structure has optimal size in terms of $n$ and $δ$. We extend our results to several related problems, such as finding $k$-MEMs, MUMs, rare MEMs, and applications.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源