论文标题
精确的分子 - 基于轨道的机器学习能量通过无监督的化学空间聚类
Accurate Molecular-Orbital-Based Machine Learning Energies via Unsupervised Clustering of Chemical Space
论文作者
论文摘要
我们引入了一种无监督的聚类算法,以使用基于分子 - 轨道的机器学习(MOB-ML)来提高训练效率和准确性,以预测能量。这项工作以完全自动的方式通过高斯混合模型(GMM)确定簇,并简化了早期的监督聚类方法[J.化学理论计算,15,6668(2019)]通过消除用户指定参数的必要性和培训附加分类器的必要性。 GMM的无监督聚类结果具有准确地重现前沿分子轨道的化学直觉分组,并通过越来越多的训练示例来提高性能。由监督或无监督聚类所产生的簇与可扩展的高斯过程回归(GPR)或线性回归(LR)进一步结合,通过在每个群集中生成局部回归模型来准确地学习分子能。在回归器和聚类方法的所有四种组合中,GMM与可扩展的精确高斯工艺回归(GMM/GPR)结合是MOB-ML的最有效训练方案。在类似药物的分子热数据集上的分子能量学习的数值检验表明,GMM/GPR的精度,可转移性和学习效率的提高了,不仅是对MOB-ML的其他培训方案,即与GPR(RC/GPR)相结合的MOB-ML的其他培训方案,即受监督的回归群集,而无需聚集。与同一基准数据集的文献相比,GMM/GPR还提供了最佳的分子能预测。随着缩放比例的较低,GMM/GPR在壁挂式训练时间中的加速度为10.4倍,而训练尺寸为6500 qm7b-t分子,而壁挂式训练时间的加速度为10.4倍。
We introduce an unsupervised clustering algorithm to improve training efficiency and accuracy in predicting energies using molecular-orbital-based machine learning (MOB-ML). This work determines clusters via the Gaussian mixture model (GMM) in an entirely automatic manner and simplifies an earlier supervised clustering approach [J. Chem. Theory Comput., 15, 6668 (2019)] by eliminating both the necessity for user-specified parameters and the training of an additional classifier. Unsupervised clustering results from GMM have the advantage of accurately reproducing chemically intuitive groupings of frontier molecular orbitals and having improved performance with an increasing number of training examples. The resulting clusters from supervised or unsupervised clustering is further combined with scalable Gaussian process regression (GPR) or linear regression (LR) to learn molecular energies accurately by generating a local regression model in each cluster. Among all four combinations of regressors and clustering methods, GMM combined with scalable exact Gaussian process regression (GMM/GPR) is the most efficient training protocol for MOB-ML. The numerical tests of molecular energy learning on thermalized datasets of drug-like molecules demonstrate the improved accuracy, transferability, and learning efficiency of GMM/GPR over not only other training protocols for MOB-ML, i.e., supervised regression-clustering combined with GPR(RC/GPR) and GPR without clustering. GMM/GPR also provide the best molecular energy predictions compared with the ones from literature on the same benchmark datasets. With a lower scaling, GMM/GPR has a 10.4-fold speedup in wall-clock training time compared with scalable exact GPR with a training size of 6500 QM7b-T molecules.