重新编程的分子表示语言模型学习

论文标题

重新编程的分子表示语言模型学习

Reprogramming Language Models for Molecular Representation Learning

论文作者

Vinod, Ria, Chen, Pin-Yu, Das, Payel

论文摘要

转移学习的最新进步使其通过学习表示形式的转移使域适应性的有前途的方法。尤其是当相关时，当交替的任务的定义明确和标记的数据样本有限时，这在分子数据域中很常见。这使得学习成为解决分子学习任务的理想方法。尽管对抗性重编程已被证明是重新利用神经网络以进行替代任务的成功方法，但大多数作品都考虑了同一域内的源和替代任务。在这项工作中，我们提出了一种新的算法，通过词典学习（R2DL）重新编程，以对对抗性重编程的分子学习任务预审前的语言模型，这是通过利用艺术语言模型的大规模表述来激发的。对抗性程序使用k-SVD求解器学习了密集的源模型输入空间（语言数据）和稀疏目标模型输入空间（例如化学和生物分子数据）之间的线性转换，以通过词典学习近似编码数据的稀疏表示。 R2DL实现了在有限的训练数据设置中训练的最先进的毒性预测模型，该基线在有限的训练数据设置中胜过基线，从而确定了针对分子数据的任务的域 - 不可抑制的转移学习途径。

Recent advancements in transfer learning have made it a promising approach for domain adaptation via transfer of learned representations. This is especially when relevant when alternate tasks have limited samples of well-defined and labeled data, which is common in the molecule data domain. This makes transfer learning an ideal approach to solve molecular learning tasks. While Adversarial reprogramming has proven to be a successful method to repurpose neural networks for alternate tasks, most works consider source and alternate tasks within the same domain. In this work, we propose a new algorithm, Representation Reprogramming via Dictionary Learning (R2DL), for adversarially reprogramming pretrained language models for molecular learning tasks, motivated by leveraging learned representations in massive state of the art language models. The adversarial program learns a linear transformation between a dense source model input space (language data) and a sparse target model input space (e.g., chemical and biological molecule data) using a k-SVD solver to approximate a sparse representation of the encoded data, via dictionary learning. R2DL achieves the baseline established by state of the art toxicity prediction models trained on domain-specific data and outperforms the baseline in a limited training-data setting, thereby establishing avenues for domain-agnostic transfer learning for tasks with molecule data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题