蛋白质语言模型和结构预测：连接和进展

论文标题

蛋白质语言模型和结构预测：连接和进展

Protein Language Models and Structure Prediction: Connection and Progression

论文作者

Hu, Bozhen, Xia, Jun, Zheng, Jiangbin, Tan, Cheng, Huang, Yufei, Xu, Yongjie, Li, Stan Z.

论文摘要

序列中蛋白质结构的预测是功能预测，药物设计和相关生物过程理解的重要任务。最近的进步证明了语言模型（LMS）在处理蛋白质序列数据库中的力量，该数据库继承了注意力网络的优势并捕获了蛋白质的学习表示中有用的信息。过去两年在三级蛋白质结构预测（PSP）中取得了显着的成功，包括基于进化的和基于单序列的PSP。看来，基于蛋白质语言模型（PLM）的管道没有使用基于能量的模型和采样程序，而是在PSP中以主流范例的形式出现。尽管取得了成果，但PSP社区仍需要进行系统的和最新的调查，以帮助弥合LMS自然语言处理（NLP）和PSP域之间的差距，并介绍其方法，进步和实用应用。为此，在本文中，我们首先介绍允许LMS扩展到PLM的蛋白质和人类语言之间的相似性，并应用于蛋白质数据库。然后，我们从网络体系结构，预训练策略，应用程序和常用蛋白质数据库的角度从系统地回顾LMS和PLM的最新进展。接下来，讨论了PSP的不同类型的方法，尤其是基于PLM的架构在蛋白质折叠过程中如何发挥作用。最后，我们确定了PSP社区和PLM的进步方面所面临的挑战。这项调查旨在成为研究人员的动手指南，以了解PSP方法，开发PLMS并解决该领域的具有挑战性的问题。

The prediction of protein structures from sequences is an important task for function prediction, drug design, and related biological processes understanding. Recent advances have proved the power of language models (LMs) in processing the protein sequence databases, which inherit the advantages of attention networks and capture useful information in learning representations for proteins. The past two years have witnessed remarkable success in tertiary protein structure prediction (PSP), including evolution-based and single-sequence-based PSP. It seems that instead of using energy-based models and sampling procedures, protein language model (pLM)-based pipelines have emerged as mainstream paradigms in PSP. Despite the fruitful progress, the PSP community needs a systematic and up-to-date survey to help bridge the gap between LMs in the natural language processing (NLP) and PSP domains and introduce their methodologies, advancements and practical applications. To this end, in this paper, we first introduce the similarities between protein and human languages that allow LMs extended to pLMs, and applied to protein databases. Then, we systematically review recent advances in LMs and pLMs from the perspectives of network architectures, pre-training strategies, applications, and commonly-used protein databases. Next, different types of methods for PSP are discussed, particularly how the pLM-based architectures function in the process of protein folding. Finally, we identify challenges faced by the PSP community and foresee promising research directions along with the advances of pLMs. This survey aims to be a hands-on guide for researchers to understand PSP methods, develop pLMs and tackle challenging problems in this field for practical purposes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题