论文标题

来自未注释数据的可解释的RNA基础模型,以进行高度精确的RNA结构和功能预测

Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions

论文作者

Chen, Jiayang, Hu, Zhihang, Sun, Siqi, Tan, Qingxiong, Wang, Yixuan, Yu, Qinze, Zong, Licheng, Hong, Liang, Xiao, Jin, Shen, Tao, King, Irwin, Li, Yu

论文摘要

非编码RNA结构和功能对于理解各种生物学过程,例如细胞信号传导,基因表达和转录后调节至关重要。这些都是RNA场中的核心问题之一。随着测序技术的快速增长,我们积累了大量未注释的RNA序列。另一方面,昂贵的实验天文台仅导致有限数量的带注释的数据和3D结构。因此,设计用于预测其结构和功能的计算方法仍然具有挑战性。缺乏注释的数据和系统研究会导致表现较低。为了解决这个问题,我们提出了一种新型的RNA基础模型(RNA-FM),以通过自我监督的学习来利用所有2300万个非编码RNA序列。在这种方法中,我们发现预先训练的RNA-FM可以在不使用任何标签的情况下推断非编码RNA的顺序和进化信息。此外,我们通过将RNA-FM应用于下游二级/3D结构预测,SARS-COV-2基因组结构和进化预测,蛋白RNA结合偏好模型以及基因表达调节模型来证明RNA-FM的有效性。综合实验表明,所提出的方法改善了RNA结构和功能建模的结果显着,一致。尽管仅接受了未标记的数据训练,但RNA-FM还是可以作为该领域的基础模型。

Non-coding RNA structure and function are essential to understanding various biological processes, such as cell signaling, gene expression, and post-transcriptional regulations. These are all among the core problems in the RNA field. With the rapid growth of sequencing technology, we have accumulated a massive amount of unannotated RNA sequences. On the other hand, expensive experimental observatory results in only limited numbers of annotated data and 3D structures. Hence, it is still challenging to design computational methods for predicting their structures and functions. The lack of annotated data and systematic study causes inferior performance. To resolve the issue, we propose a novel RNA foundation model (RNA-FM) to take advantage of all the 23 million non-coding RNA sequences through self-supervised learning. Within this approach, we discover that the pre-trained RNA-FM could infer sequential and evolutionary information of non-coding RNAs without using any labels. Furthermore, we demonstrate RNA-FM's effectiveness by applying it to the downstream secondary/3D structure prediction, SARS-CoV-2 genome structure and evolution prediction, protein-RNA binding preference modeling, and gene expression regulation modeling. The comprehensive experiments show that the proposed method improves the RNA structural and functional modelling results significantly and consistently. Despite only being trained with unlabelled data, RNA-FM can serve as the foundational model for the field.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源