论文标题
线性折:线性时间近似RNA通过5'-to-3'动态编程和光束搜索折叠
LinearFold: linear-time approximate RNA folding by 5'-to-3' dynamic programming and beam search
论文作者
论文摘要
动机:预测RNA序列的二级结构在许多应用中都是有用的。现有的算法(基于动态编程)遭受了一个主要限制:它们的运行时间与RNA长度立方体尺度,并且这种缓慢限制了它们在全基因组应用中的使用。 结果:我们提出了一种新颖的替代$ O(n^3)$ - 时间动态动态编程算法,用于RNA折叠,这与启发式方法相吻合,使其以$ O(n)$时间和$ O(n)$空间运行,同时生成了最佳解决方案的高质量近似。受到计算语言学中无上下文语法的增量解析的启发,我们的替代动态编程算法扫描了从左到右(5'-to-3')方向而不是以自下而上的方式扫描序列,这使我们能够采用有效的光束修剪启发式。我们的工作虽然不精确,但它是第一个实现线性运行时(和线性空间)的RNA折叠算法,而不会对输出结构施加约束。令人惊讶的是,我们的近似搜索导致在具有已知结构的序列数据库上提高了更高的总体准确性。更有趣的是,它会导致对该数据库中最长的序列家族(16S和23S核糖体RNA)的更准确的预测,并改善了长期基本对(500多个核苷酸分开)的精确度,这两种对当前模型都很具有挑战性。 可用性:我们的源代码可在https://github.com/linearfold/linearfold上获得,我们的WebServer在http://linearfold.org(序列限制:100,000NT)。
Motivation: Predicting the secondary structure of an RNA sequence is useful in many applications. Existing algorithms (based on dynamic programming) suffer from a major limitation: their runtimes scale cubically with the RNA length, and this slowness limits their use in genome-wide applications. Results: We present a novel alternative $O(n^3)$-time dynamic programming algorithm for RNA folding that is amenable to heuristics that make it run in $O(n)$ time and $O(n)$ space, while producing a high-quality approximation to the optimal solution. Inspired by incremental parsing for context-free grammars in computational linguistics, our alternative dynamic programming algorithm scans the sequence in a left-to-right (5'-to-3') direction rather than in a bottom-up fashion, which allows us to employ the effective beam pruning heuristic. Our work, though inexact, is the first RNA folding algorithm to achieve linear runtime (and linear space) without imposing constraints on the output structure. Surprisingly, our approximate search results in even higher overall accuracy on a diverse database of sequences with known structures. More interestingly, it leads to significantly more accurate predictions on the longest sequence families in that database (16S and 23S Ribosomal RNAs), as well as improved accuracies for long-range base pairs (500+ nucleotides apart), both of which are well known to be challenging for the current models. Availability: Our source code is available at https://github.com/LinearFold/LinearFold, and our webserver is at http://linearfold.org (sequence limit: 100,000nt).