论文标题
基因组压缩反对参考
Genome Compression Against a Reference
论文作者
论文摘要
能够存储和传输人基因组序列是基因组研究和工业应用中的重要组成部分。完整的人类基因组具有31亿个碱基对(单倍体),并且整个基因组天真地储存约3 GB,这对于大规模使用而言是不可行的。 但是,人类基因组高度多余。任何给定个人的基因组都会与另一个人的基因组不同。有一些工具,例如DNAZIP,它仅通过记录给定序列和参考基因组序列之间的差异来表达给定的基因组序列。这允许将给定的基因组无效地压缩到〜4 MB的大小。 在这项工作中,我们展示了DNAZIP库以外的其他改进,在DNAZIP已经令人印象深刻的结果之上,我们还显示了〜11%的压缩。这将允许在传输人类基因组序列的磁盘空间和网络成本中进一步节省。
Being able to store and transmit human genome sequences is an important part in genomic research and industrial applications. The complete human genome has 3.1 billion base pairs (haploid), and storing the entire genome naively takes about 3 GB, which is infeasible for large scale usage. However, human genomes are highly redundant. Any given individual's genome would differ from another individual's genome by less than 1%. There are tools like DNAZip, which express a given genome sequence by only noting down the differences between the given sequence and a reference genome sequence. This allows losslessly compressing the given genome to ~ 4 MB in size. In this work, we demonstrate additional improvements on top of the DNAZip library, where we show an additional ~ 11% compression on top of DNAZip's already impressive results. This would allow further savings in disk space and network costs for transmitting human genome sequences.