论文标题
预期编辑距离的可计算界限和蒙特卡洛估计值
Computable Bounds and Monte Carlo Estimates of the Expected Edit Distance
论文作者
论文摘要
编辑距离是字符串之间差异的指标,在计算生物学,语音识别和机器学习中广泛应用。令$ e_k(n)$表示$ n $字符的随机编辑距离与大小$ k $的字母之间的平均编辑距离。对于$ k \ geq 2 $,这是一个开放的问题,如何有效地计算$α_{k}(n)= e_k(n)/n $的确切值以及$α_{k} = \ lim_ {n \ to \ infty}α_{k}α_{k}(k}(n)$,已知的限制。 本文表明,对于特定的$ q(n)=θ(\ sqrt {\ sqrt {\ log n / n})$,$α_k(n)-q(n)-q(n)\leqα_K\ leqleqα_K(n)$,这意味着$α_k$是可计算的。探索了$α_k(n)$的确切计算,导致算法在时间$ t = \ mathcal {o}(n^2k \ min(3^n,k^n))$中,这是一种复杂性,使其具有有限的实际使用。 根据McDiarmid的不平等,提出了对统计估计的分析,以表明如何以良好的准确性,高置信度和合理的计算时间来评估$α_k(n)$,以$ n $ n $的价值(最多可达25万美元)。相应地,以$α_k$获得的99.9 \%置信区间约为$ 10^{-2} $。 利用编辑脚本的组合参数来分析表征有效计算的下限$β_K^*$ to $α_k$,因此$ \ lim_ {k \ to \ infty}β_K^*= 1 $。通常,$β_K^* \leqα_K\ leq 1-1/k $;对于$ k $,比数十个大的$β_K^*$要比以宽度为1-1/k-β_k^*$生成良好的统计估计值要快得多。 本文中开发的技术对大多数先前发表的数值以及字母大小和弦长的结果的结果产生了改进,以前未报告。
The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let $e_k(n)$ denote the average edit distance between random, independent strings of $n$ characters from an alphabet of size $k$. For $k \geq 2$, it is an open problem how to efficiently compute the exact value of $α_{k}(n) = e_k(n)/n$ as well as of $α_{k} = \lim_{n \to \infty} α_{k}(n)$, a limit known to exist. This paper shows that $α_k(n)-Q(n) \leq α_k \leq α_k(n)$, for a specific $Q(n)=Θ(\sqrt{\log n / n})$, a result which implies that $α_k$ is computable. The exact computation of $α_k(n)$ is explored, leading to an algorithm running in time $T=\mathcal{O}(n^2k\min(3^n,k^n))$, a complexity that makes it of limited practical use. An analysis of statistical estimates is proposed, based on McDiarmid's inequality, showing how $α_k(n)$ can be evaluated with good accuracy, high confidence level, and reasonable computation time, for values of $n$ say up to a quarter million. Correspondingly, 99.9\% confidence intervals of width approximately $10^{-2}$ are obtained for $α_k$. Combinatorial arguments on edit scripts are exploited to analytically characterize an efficiently computable lower bound $β_k^*$ to $α_k$, such that $ \lim_{k \to \infty} β_k^*=1$. In general, $β_k^* \leq α_k \leq 1-1/k$; for $k$ greater than a few dozens, computing $β_k^*$ is much faster than generating good statistical estimates with confidence intervals of width $1-1/k-β_k^*$. The techniques developed in the paper yield improvements on most previously published numerical values as well as results for alphabet sizes and string lengths not reported before.