论文标题
部分可观测时空混沌系统的无模型预测
Exploring evolution-aware & -free protein language models as protein function predictors
论文作者
论文摘要
大规模蛋白质语言模型(PLM)在蛋白质预测任务中的性能提高,范围从3D结构预测到各种功能预测。特别是,Alphafold(一种开创性的AI系统)可能会重塑结构生物学。但是,尚未在结构预测之外探索PLM模块在Alphafold(Evoformer)中的效用。在本文中,我们研究了三个流行的PLM的表示能力:ESM-1B(单序),MSA变形器(多个序列比对)和Evoformer(结构)(结构),并特别关注Evoformer。具体而言,我们旨在回答以下关键问题:(i)作为Alphafold的一部分,Evoformer是否会产生可预测蛋白质功能的表示形式? (ii)如果是的,可以替换ESM-1B和MSA转换器? (ii)这些PLM多少依赖于进化相关的蛋白质数据?在这方面,它们是彼此补充的吗?我们通过实证研究以及新的见解和结论来比较这些模型。所有用于可重复性的代码和数据集可在https://github.com/elttaes/revisiting-plms上找到。
Large-scale Protein Language Models (PLMs) have improved performance in protein prediction tasks, ranging from 3D structure prediction to various function predictions. In particular, AlphaFold, a ground-breaking AI system, could potentially reshape structural biology. However, the utility of the PLM module in AlphaFold, Evoformer, has not been explored beyond structure prediction. In this paper, we investigate the representation ability of three popular PLMs: ESM-1b (single sequence), MSA-Transformer (multiple sequence alignment) and Evoformer (structural), with a special focus on Evoformer. Specifically, we aim to answer the following key questions: (i) Does the Evoformer trained as part of AlphaFold produce representations amenable to predicting protein function? (ii) If yes, can Evoformer replace ESM-1b and MSA-Transformer? (ii) How much do these PLMs rely on evolution-related protein data? In this regard, are they complementary to each other? We compare these models by empirical study along with new insights and conclusions. All code and datasets for reproducibility are available at https://github.com/elttaes/Revisiting-PLMs.