论文标题
误差校正代码,用于简短串联重复和替换错误
Error-correcting Codes for Short Tandem Duplication and Substitution Errors
论文作者
论文摘要
由于其高数据密度和寿命,DNA被认为是满足越来越多的数据存储需求的有希望的媒介。但是,在DNA序列中发生的错误的多样性使有效的错误纠正成为一个具有挑战性的任务。本文旨在解决同时纠正两种类型的错误,即简短的串联重复和替换错误。我们专注于最多3个长度的串联重复,设计代码是为了纠正任意数量的重复错误和一个替换错误。因为可以多次复制替换的符号(作为各个长度的子字符串的一部分),所以单个替换会影响检索到的单词的无限基因。但是,我们表明,通过适当的预处理,效果可能仅限于有限长度的子字符串,从而使有效的错误纠正成为可能。我们构建一个代码,用于纠正上述错误并为其速率提供下限。与仅纠正重复误差的最佳代码相比,数值结果表明,当字母的尺寸为4时,保护额外替换的渐近成本仅为0.003位/符号,这是与DNA中数据存储相对应的重要情况。
Due to its high data density and longevity, DNA is considered a promising medium for satisfying ever-increasing data storage needs. However, the diversity of errors that occur in DNA sequences makes efficient error-correction a challenging task. This paper aims to address simultaneously correcting two types of errors, namely, short tandem duplication and substitution errors. We focus on tandem repeats of length at most 3 and design codes for correcting an arbitrary number of duplication errors and one substitution error. Because a substituted symbol can be duplicated many times (as part of substrings of various lengths), a single substitution can affect an unbounded substring of the retrieved word. However, we show that with appropriate preprocessing, the effect may be limited to a substring of finite length, thus making efficient error-correction possible. We construct a code for correcting the aforementioned errors and provide lower bounds for its rate. Compared to optimal codes correcting only duplication errors, numerical results show that the asymptotic cost of protecting against an additional substitution is only 0.003 bits/symbol when the alphabet has size 4, an important case corresponding to data storage in DNA.