泰格：分子财产预测的任务类型的积极学习

论文标题

泰格：分子财产预测的任务类型的积极学习

Tyger: Task-Type-Generic Active Learning for Molecular Property Prediction

论文作者

Zhou, Kuangqi, Wang, Kaixin, Feng, Jiashi, Tang, Jian, Xu, Tingyang, Wang, Xinchao

论文摘要

如何准确预测分子的特性是AI驱动的药物发现的基本问题，这通常需要大量注释来训练深度学习模型。但是，注释分子的成本很高，因为它需要专家进行的实验实验。为了降低注释成本，开发了深度积极学习（AL）方法，以仅选择注释最具代表性和信息性的数据。但是，现有的最佳深层AL方法主要是针对单一类型的学习任务（例如，单标签分类）开发的，因此在涉及各种任务类型的分子属性预测中可能表现不佳。在本文中，我们提出了一个任务类型的主动学习框架（称为Tyger），该框架能够以统一的方式处理不同类型的学习任务。关键是要学习一个化学意义的嵌入空间，并根据嵌入式进行充分的选择，而不是依靠现有作品中所做的那样，而不是依靠任务类型特定的启发式方法（例如，班级预测概率）。具体而言，为了学习嵌入空间，我们实例化了一个查询模块，该模块学会将分子图转换为相应的微笑字符串。此外，为了确保从空间中选择的样本既代表性又有益，我们建议通过两个学习目标来塑造嵌入式空间，一个基于域知识，另一个基于任务学习者的反馈（即执行手头学习任务的模型）。我们在不同任务类型的基准数据集上进行了广泛的实验。实验结果表明，泰格（Tyger）在分子性质预测上始终达到高度的表现，超过了较大的基线。我们还执行了融合实验，以验证泰格中每个组件的有效性。

How to accurately predict the properties of molecules is an essential problem in AI-driven drug discovery, which generally requires a large amount of annotation for training deep learning models. Annotating molecules, however, is quite costly because it requires lab experiments conducted by experts. To reduce annotation cost, deep Active Learning (AL) methods are developed to select only the most representative and informative data for annotating. However, existing best deep AL methods are mostly developed for a single type of learning task (e.g., single-label classification), and hence may not perform well in molecular property prediction that involves various task types. In this paper, we propose a Task-type-generic active learning framework (termed Tyger) that is able to handle different types of learning tasks in a unified manner. The key is to learn a chemically-meaningful embedding space and perform active selection fully based on the embeddings, instead of relying on task-type-specific heuristics (e.g., class-wise prediction probability) as done in existing works. Specifically, for learning the embedding space, we instantiate a querying module that learns to translate molecule graphs into corresponding SMILES strings. Furthermore, to ensure that samples selected from the space are both representative and informative, we propose to shape the embedding space by two learning objectives, one based on domain knowledge and the other leveraging feedback from the task learner (i.e., model that performs the learning task at hand). We conduct extensive experiments on benchmark datasets of different task types. Experimental results show that Tyger consistently achieves high AL performance on molecular property prediction, outperforming baselines by a large margin. We also perform ablative experiments to verify the effectiveness of each component in Tyger.

下载PDF全文

下载文献需遵守相关版权规定

论文标题