论文标题
使用二进制表示的快速基因组光学图组件算法
Fast genomic optical map assembly algorithm using binary representation
论文作者
论文摘要
减少下一代测序技术提供的测序基因组成本大大增加了基因组项目的数量。结果,需要更好地组装和组装验证方法。一个有希望的想法是在组装项目中使用异质数据。光学映射(OM)有益于验证基因组组件,校正和脚手架。单个RAW OM读取描述了DNA分子的长片段,最高为1Mbp。可以组装来自同一基因组的原始OM数据以创建跨越整个染色体的共识图。 由于输入数据中有大量错误,因此组装过程在计算上很难。 这项工作描述了一种新的算法和计算机程序,用于组装OM读取而无需参考基因组。在我们的算法中,我们探索了基因组图的二进制表示。我们专注于数据结构和算法的效率,并在并行平台上扩展。该算法由几个步骤组成,其中最重要的是:(1)将限制图转换为二进制字符串,(2)检测限制图之间的重叠,(3)确定限制图集的布局,(4)共识基因组图的创建。我们的算法处理具有较低误差级别的光学映射数据,但读取高级错误的失败。 我们为Python语言开发了软件库,控制台应用程序和模块。事实证明,本文提出的方法比动态编程方法更快,并且在无错误的数据上表现良好。它可以用作\ textIt {de〜novo}汇编管道的步骤或检测错误填充物。该软件可在GNU LGPL V3许可证(https://sourceforge.net/p/binary-genome-maps/code)下自由使用。
Reducing the cost of sequencing genomes provided by next-generation sequencing technologies has greatly increased the number of genomic projects. As a result, there is a growing need for better assembly and assembly validation methods. One promising idea is to use heterogeneous data in assembly projects. Optical Mapping (OM) is beneficial in validating genomic assemblies, correction and scaffolding. Single raw OM read describes a DNA molecule's long fragment, up to 1Mbp. Raw OM data from the same genome could be assembled to create consensus maps that span an entire chromosome. The assembly process is computationally hard because of the large number of errors in input data. This work describes a new algorithm and computer program to assemble OM reads without a reference genome. In our algorithm, we explored binary representation for genome maps. We focused on the efficiency of data structures and algorithms and scale on parallel platforms. The algorithm consists of several steps, of which the most important are : (1) conversion of the restriction maps into binary strings, (2) detection of overlaps between restriction maps, (3) determining the layout of restriction maps set, (4) creation of consensus genomic maps. Our algorithm deals with optical mapping data with low error levels but fails with high-level error reads. We developed a software library, console application and module for Python language. The approach presented in this paper proved to be faster than a dynamic programming approach and performed well on error-free data. It could be used as a step of \textit{de~novo} assembly pipelines or to detect misassemblies.The software is freely available in a public repository under GNU LGPL v3 license (https://sourceforge.net/p/binary-genome-maps/code).