药物：AI辅助药物发现的分布（OOD）数据集策展人和基准 - 重点是噪声注释的亲和力预测问题

论文标题

药物：AI辅助药物发现的分布（OOD）数据集策展人和基准 - 重点是噪声注释的亲和力预测问题

DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise Annotations

论文作者

Ji, Yuanfeng, Zhang, Lu, Wu, Jiaxiang, Wu, Bingzhe, Huang, Long-Kai, Xu, Tingyang, Rong, Yu, Li, Lanqing, Ren, Jie, Xue, Ding, Lai, Houtim, Xu, Shaoyong, Feng, Jing, Liu, Wei, Luo, Ping, Zhou, Shuigeng, Huang, Junzhou, Zhao, Peilin, Bian, Yatao

论文摘要

AI辅助药物发现（AIDD）由于承诺使新药更快，更便宜，更高效地寻找新药物，因此正在越来越受欢迎。尽管它在许多领域中进行了广泛的使用，例如助理预测，虚拟筛查，蛋白质折叠和生成性化学，但在\ emph {noise}的学习问题（OOD）学习问题方面几乎没有探索，这在现实世界中不可避免的是现实世界的艾滋病应用程序。在这项工作中，我们介绍了Augood，这是AI辅助药物发现的系统的OOD数据集策展人和基准，它带有一个开源Python软件包，可完全自动化数据策展和OOD基准测试过程。我们关注AIDD中最关键的问题之一：药物靶标结合亲和力预测，涉及大分子（蛋白质靶标）和小分子（药物化合物）。与仅提供固定的数据集相反，Duguood为用户友好的自定义脚本提供自动数据集策展人，与生物化学知识相符的丰富域注释，现实噪声注释和对最先进的ART OOD OOD算法的严格基准测试。由于分子数据通常使用图神经网络（GNN）骨架建模为不规则图，因此药物也可以用作\ emph {Graph ood Learning}问题的有价值的测试。广泛的经验研究表明，分布和分布实验之间存在显着的性能差距，这突出了需要开发更好的方案，以便在aidd噪声下允许OOD泛化。

AI-aided drug discovery (AIDD) is gaining increasing popularity due to its promise of making the search for new pharmaceuticals quicker, cheaper and more efficient. In spite of its extensive use in many fields, such as ADMET prediction, virtual screening, protein folding and generative chemistry, little has been explored in terms of the out-of-distribution (OOD) learning problem with \emph{noise}, which is inevitable in real world AIDD applications. In this work, we present DrugOOD, a systematic OOD dataset curator and benchmark for AI-aided drug discovery, which comes with an open-source Python package that fully automates the data curation and OOD benchmarking processes. We focus on one of the most crucial problems in AIDD: drug target binding affinity prediction, which involves both macromolecule (protein target) and small-molecule (drug compound). In contrast to only providing fixed datasets, DrugOOD offers automated dataset curator with user-friendly customization scripts, rich domain annotations aligned with biochemistry knowledge, realistic noise annotations and rigorous benchmarking of state-of-the-art OOD algorithms. Since the molecular data is often modeled as irregular graphs using graph neural network (GNN) backbones, DrugOOD also serves as a valuable testbed for \emph{graph OOD learning} problems. Extensive empirical studies have shown a significant performance gap between in-distribution and out-of-distribution experiments, which highlights the need to develop better schemes that can allow for OOD generalization under noise for AIDD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题