论文标题

重新编程抗体序列填充的语言模型

Reprogramming Pretrained Language Models for Antibody Sequence Infilling

论文作者

Melnyk, Igor, Chenthamarakshan, Vijil, Chen, Pin-Yu, Das, Payel, Dhurandhar, Amit, Padhi, Inkit, Das, Devleena

论文摘要

抗体是最通用的结合分子类别的抗体,在生物医学中有许多应用。抗体的计算设计涉及生成新颖和多样的序列,同时保持结构一致性。抗体独有的,设计互补性确定的区域(CDR),该区域决定了抗原结合亲和力和特异性,从而产生了自己的独特挑战。最近的深度学习模型已显示出令人印象深刻的结果,但是数量有限的已知抗体序列/结构对经常导致性能降解,尤其是在生成的序列中缺乏多样性。在我们的工作中,我们通过利用模型重编程(MR)来应对这一挑战,该模型对源语言进行了预处理的模型,以适应以不同语言并具有稀缺数据的任务 - 可能很难从scratch中训练高性能模型,或者在特定任务上对现有的预训练模型进行微调。具体而言,我们引入了代表,其中重新使用了预处理的英语模型以填充蛋白质序列 - 因此使用较少的数据考虑了交叉语言适应。抗体设计基准的结果表明,我们的低资源抗体序列数据集的模型可提供高度多样的CDR序列,而不是基准的多样性增加了两倍以上,而不会失去结构完整性和自然性。生成的序列还表现出增强的抗原结合特异性和病毒中和能力。代码可从https://github.com/ibm/reprogbert获得

Antibodies comprise the most versatile class of binding molecules, with numerous applications in biomedicine. Computational design of antibodies involves generating novel and diverse sequences, while maintaining structural consistency. Unique to antibodies, designing the complementarity-determining region (CDR), which determines the antigen binding affinity and specificity, creates its own unique challenges. Recent deep learning models have shown impressive results, however the limited number of known antibody sequence/structure pairs frequently leads to degraded performance, particularly lacking diversity in the generated sequences. In our work we address this challenge by leveraging Model Reprogramming (MR), which repurposes pretrained models on a source language to adapt to the tasks that are in a different language and have scarce data - where it may be difficult to train a high-performing model from scratch or effectively fine-tune an existing pre-trained model on the specific task. Specifically, we introduce ReprogBert in which a pretrained English language model is repurposed for protein sequence infilling - thus considers cross-language adaptation using less data. Results on antibody design benchmarks show that our model on low-resourced antibody sequence dataset provides highly diverse CDR sequences, up to more than a two-fold increase of diversity over the baselines, without losing structural integrity and naturalness. The generated sequences also demonstrate enhanced antigen binding specificity and virus neutralization ability. Code is available at https://github.com/IBM/ReprogBERT

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源