论文标题

地理:财产预测和分子产生的能量注销的分子构象

GEOM: Energy-annotated molecular conformations for property prediction and molecular generation

论文作者

Axelrod, Simon, Gomez-Bombarelli, Rafael

论文摘要

机器学习(ML)在许多分子设计任务中都优于传统方法。 ML模型通常可以从2D化学图或单个3D结构中预测分子特性,但是这些表示都不说明分子可访问的3D构型的集合。可以通过将构象合格作为输入来改善属性预测,但是没有大规模数据集包含带有准确构象异构体和实验数据的图表。在这里,我们使用先进的采样和半经验密度功能理论(DFT)来生成3700万个分子构象,以超过450,000个分子。分子(GEOM)数据集的几何集合包含QM9的133,000种物种的构象异构体,有317,000种与生物物理学,生理学和物理化学有关的实验数据。带有BACE-1抑制数据的1,511种的集合也标有隐式水溶剂中高质量的DFT自由能,而534个集合可以通过DFT进一步优化。 GEOM将有助于开发从构象异构体组合中预测属性的模型,以及样品3D构象的生成模型。

Machine learning (ML) outperforms traditional approaches in many molecular design tasks. ML models usually predict molecular properties from a 2D chemical graph or a single 3D structure, but neither of these representations accounts for the ensemble of 3D conformers that are accessible to a molecule. Property prediction could be improved by using conformer ensembles as input, but there is no large-scale dataset that contains graphs annotated with accurate conformers and experimental data. Here we use advanced sampling and semi-empirical density functional theory (DFT) to generate 37 million molecular conformations for over 450,000 molecules. The Geometric Ensemble Of Molecules (GEOM) dataset contains conformers for 133,000 species from QM9, and 317,000 species with experimental data related to biophysics, physiology, and physical chemistry. Ensembles of 1,511 species with BACE-1 inhibition data are also labeled with high-quality DFT free energies in an implicit water solvent, and 534 ensembles are further optimized with DFT. GEOM will assist in the development of models that predict properties from conformer ensembles, and generative models that sample 3D conformations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源