论文标题
后代:探索蛋白质语言模型的边界
ProGen2: Exploring the Boundaries of Protein Language Models
论文作者
论文摘要
基于注意的蛋白质序列训练的基于注意力的模型在分类和与人工智能驱动的蛋白质设计相关的分类和生成任务方面取得了令人难以置信的成功。但是,我们对非常大规模的模型和数据在有效的蛋白质模型开发中发挥作用。我们介绍了一组名为Grocen2的蛋白质语言模型,该模型的比例最高为6.4b参数,并在不同的序列数据集上训练,这些数据集是从超过十亿个蛋白质中从基因组,元基因组和免疫曲目数据库中汲取的。 PECEN2模型在捕获观察到的进化序列的分布,产生新型的可行序列并预测蛋白质适应性的情况下显示出最新的性能,而无需额外的芬太尼。随着蛋白质序列的大型大小和原始数量继续变得更加广泛,我们的结果表明,越来越多的重点需要放在提供给蛋白质序列模型的数据分布上。我们在https://github.com/salesforce/progen上发布了Gecen2模型和代码。
Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional finetuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. We release the ProGen2 models and code at https://github.com/salesforce/progen.