论文标题

突出器:使用文本描述的零拍蛋白函数预测

ProTranslator: zero-shot protein function prediction using textual description

论文作者

Xu, Hanwen, Wang, Sheng

论文摘要

准确地找到具有一定功能的蛋白质和基因是广泛的生物医学应用的先决条件。尽管现有的蛋白质功能预测中现有计算方法的进展令人鼓舞,但将蛋白质注释到基因本体论中未收集并且没有任何带注释的蛋白质的新功能仍然具有挑战性。这种局限性是蛋白质功能预测的广泛使用的多标签分类问题设定的副作用,阻碍了研究新途径和生物过程的进步,并进一步减慢了各种生物医学领域的研究。在这里,我们仅根据其文本描述将蛋白质注释到功能来解决此问题,以便我们不需要了解此功能的任何相关蛋白质。我们方法突起器的关键思想是将蛋白质函数预测为机器翻译问题,该蛋白质函数将功能的描述序列转换为蛋白质的氨基酸序列。然后,我们可以从具有相似文本描述的函数转移注释以注释新功能。我们观察到注释新功能的注释和稀疏注释功能在CAFA3,SwissProt和GoA数据集方面进行了重大改进。我们进一步证明了我们的方法仅基于途径描述,如何准确预测Reactome,KeGG和MSIGDB中给定途径的基因成员。最后,我们展示了突出器如何使我们能够生成文本描述,而不是一组蛋白质的功能标签,从而为蛋白质功能预测提供了新的方案。我们设想突出器将产生蛋白质功能“搜索引擎”,该蛋白质功能根据用户查询的免费文本返回蛋白质列表。

Accurately finding proteins and genes that have a certain function is the prerequisite for a broad range of biomedical applications. Despite the encouraging progress of existing computational approaches in protein function prediction, it remains challenging to annotate proteins to a novel function that is not collected in the Gene Ontology and does not have any annotated proteins. This limitation, a side effect from the widely-used multi-label classification problem setting of protein function prediction, hampers the progress of studying new pathways and biological processes, and further slows down research in various biomedical areas. Here, we tackle this problem by annotating proteins to a function only based on its textual description so that we do not need to know any associated proteins for this function. The key idea of our method ProTranslator is to redefine protein function prediction as a machine translation problem, which translates the description word sequence of a function to the amino acid sequence of a protein. We can then transfer annotations from functions that have similar textual description to annotate a novel function. We observed substantial improvement in annotating novel functions and sparsely annotated functions on CAFA3, SwissProt and GOA datasets. We further demonstrated how our method accurately predicted gene members for a given pathway in Reactome, KEGG and MSigDB only based on the pathway description. Finally, we showed how ProTranslator enabled us to generate the textual description instead of the function label for a set of proteins, providing a new scheme for protein function prediction. We envision ProTranslator will give rise to a protein function "search engine" that returns a list of proteins based on the free text queried by the user.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源