论文标题
Helixfold单一:通过使用蛋白质语言模型作为替代方案预测无MSA的蛋白质结构
HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein Language Model as an Alternative
论文作者
论文摘要
基于AI的蛋白质结构预测管道(例如AlphaFold2)已达到了几乎实验的准确性。这些高级管道主要依靠多个序列比对(MSA)作为输入来从同源序列中学习共进化信息。尽管如此,从蛋白质数据库中搜索MSA仍在耗时,通常需要数十分钟。因此,我们尝试通过仅使用蛋白质的主要序列来探索快速蛋白质结构预测的限制。提出了Helixfold单一的形式将大规模蛋白质语言模型与AlphaFold2的优越几何学习能力相结合。我们提出的方法,Helixfold单个,首先预先进行大规模蛋白质语言模型(PLM),利用自我监督的学习范式使用了数千万序列序列,该序列将用作MSA的替代方法,用于学习共同进化信息。然后,通过将预训练的PLM和AlphaFold2的必需组件组合在一起,我们获得了一个端到端可区分模型,以仅从主要序列预测原子的3D坐标。 Helixfold单明一用在数据集CASP14和Cameo中验证,通过基于MSA的方法,具有大型同源家庭的基于MSA的方法,从而实现了竞争精度。此外,与主流管道进行蛋白质结构预测相比,HelixFold单个的时间少得多,这表明了其在需要许多预测的任务中的潜力。 HelixFold-Single的代码可在https://github.com/paddlepaddle/paddlehelix/paddlehelix/tree/dree/dev/dev/protein_folding/helixfold-single中获得,我们还可以在https:///paddlehelix.baidu.com/appececececececlugececececececliin/proteigineforeforeforeforeforeforeine-proteporepotepoteigin--s------------------/protein_folding/helixfold-single中提供。
AI-based protein structure prediction pipelines, such as AlphaFold2, have achieved near-experimental accuracy. These advanced pipelines mainly rely on Multiple Sequence Alignments (MSAs) as inputs to learn the co-evolution information from the homologous sequences. Nonetheless, searching MSAs from protein databases is time-consuming, usually taking dozens of minutes. Consequently, we attempt to explore the limits of fast protein structure prediction by using only primary sequences of proteins. HelixFold-Single is proposed to combine a large-scale protein language model with the superior geometric learning capability of AlphaFold2. Our proposed method, HelixFold-Single, first pre-trains a large-scale protein language model (PLM) with thousands of millions of primary sequences utilizing the self-supervised learning paradigm, which will be used as an alternative to MSAs for learning the co-evolution information. Then, by combining the pre-trained PLM and the essential components of AlphaFold2, we obtain an end-to-end differentiable model to predict the 3D coordinates of atoms from only the primary sequence. HelixFold-Single is validated in datasets CASP14 and CAMEO, achieving competitive accuracy with the MSA-based methods on the targets with large homologous families. Furthermore, HelixFold-Single consumes much less time than the mainstream pipelines for protein structure prediction, demonstrating its potential in tasks requiring many predictions. The code of HelixFold-Single is available at https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold-single, and we also provide stable web services on https://paddlehelix.baidu.com/app/drug/protein-single/forecast.