论文标题
具有语言模型的可控蛋白质设计
Controllable Protein Design with Language Models
论文作者
论文摘要
21世纪正在向人类展示前所未有的环境和医疗挑战。设计出用于特定目的的新型蛋白质的能力可以改变我们对这些问题及时反应的能力。现在,人工智能领域的最新进展正在为实现这一目标而奠定基础。蛋白质序列本质上与自然语言相似:氨基酸以多种组合形成携带功能的结构,与字母形成含义的单词和句子相同。因此,毫不奇怪,在整个自然语言处理史(NLP)的历史上,其许多技术已应用于蛋白质研究问题。在过去的几年中,我们目睹了NLP领域的革命突破。变压器预训练模型的实现使文本生成具有类似人类的功能,包括具有特定属性(例如样式或主题)的文本。由于其在NLP任务中取得的巨大成功,我们期望专门的变形金刚在不久的将来会主导自定义蛋白质序列的产生。对蛋白质家族的预先培训模型将通过新型序列扩展其曲目,这些序列可能高度分歧,但仍具有潜在功能。控制标签(例如细胞室或功能)的组合将进一步实现新型蛋白质功能的可控设计。此外,最近的模型可解释性方法将使我们能够打开“黑匣子”,从而增强我们对折叠原理的理解。虽然早期计划显示了生成语言模型设计功能序列的巨大潜力,但该领域仍处于起步阶段。我们认为,蛋白质语言模型是一个有希望的且在很大程度上没有开发的领域,并讨论了它们对蛋白质设计的可预见影响。
The 21st century is presenting humankind with unprecedented environmental and medical challenges. The ability to design novel proteins tailored for specific purposes could transform our ability to respond timely to these issues. Recent advances in the field of artificial intelligence are now setting the stage to make this goal achievable. Protein sequences are inherently similar to natural languages: Amino acids arrange in a multitude of combinations to form structures that carry function, the same way as letters form words and sentences that carry meaning. Therefore, it is not surprising that throughout the history of Natural Language Processing (NLP), many of its techniques have been applied to protein research problems. In the last few years, we have witnessed revolutionary breakthroughs in the field of NLP. The implementation of Transformer pre-trained models has enabled text generation with human-like capabilities, including texts with specific properties such as style or subject. Motivated by its considerable success in NLP tasks, we expect dedicated Transformers to dominate custom protein sequence generation in the near future. Finetuning pre-trained models on protein families will enable the extension of their repertoires with novel sequences that could be highly divergent but still potentially functional. The combination of control tags such as cellular compartment or function will further enable the controllable design of novel protein functions. Moreover, recent model interpretability methods will allow us to open the 'black box' and thus enhance our understanding of folding principles. While early initiatives show the enormous potential of generative language models to design functional sequences, the field is still in its infancy. We believe that protein language models are a promising and largely unexplored field and discuss their foreseeable impact on protein design.