论文标题
局部保存最小的K-Mers的完美散列
Locality-Preserving Minimal Perfect Hashing of k-mers
论文作者
论文摘要
最小的完美哈希是将静态的$ n $不同键映射到地址空间$ \ {1,\ ldots,n \} $的问题。众所周知,在不使用其他有关输入键的知识时,必须使用$ n \ log_2(e)$位来指定最小的完美哈希功能(MPHF)$ f $。但是,实际上,输入键具有内在关系,我们可以利用这些关系来降低$ f $的位复杂性。例如,将一个字符串及其所有独特的$ k $ - mers的集合作为输入键:由于两个连续的$ k $ -mers共享$ k-1 $符号的重叠,因此在这种情况下,似乎可以击败经典的$ \ log_2(e)$ bits/键障碍。此外,我们希望$ f $绘制连续的$ k $ - 连续地址,以便尽可能保留其在代码域中的关系。这在实践中是一项有用的功能,因为它可以保证$ f $的一定程度的参考,从而在连续$ k $ -mers查询时会更好地评估时间。在这些前提下,我们启动了针对$ k $ mers设计的新型区域性MPHF的研究,从一系列字符串中连续提取。我们设计了一种结构,其空间使用量减少了$ k $,并通过该方法的实际实现讨论了实验:实际上,使用我们的方法构建的功能可能比文献中最有效的MPHF小几倍甚至更快地查询。
Minimal perfect hashing is the problem of mapping a static set of $n$ distinct keys into the address space $\{1,\ldots,n\}$ bijectively. It is well-known that $n\log_2(e)$ bits are necessary to specify a minimal perfect hash function (MPHF) $f$, when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of $f$. For example, consider a string and the set of all its distinct $k$-mers as input keys: since two consecutive $k$-mers share an overlap of $k-1$ symbols, it seems possible to beat the classic $\log_2(e)$ bits/key barrier in this case. Moreover, we would like $f$ to map consecutive $k$-mers to consecutive addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful feature in practice as it guarantees a certain degree of locality of reference for $f$, resulting in a better evaluation time when querying consecutive $k$-mers. Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF designed for $k$-mers extracted consecutively from a collection of strings. We design a construction whose space usage decreases for growing $k$ and discuss experiments with a practical implementation of the method: in practice, the functions built with our method can be several times smaller and even faster to query than the most efficient MPHFs in the literature.