使用无监督的机器学习的核酸适体的多样化设计

论文标题

使用无监督的机器学习的核酸适体的多样化设计

Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine Learning

论文作者

Moussa, Siba, Kilgour, Michael, Jans, Clara, Hernandez-Garcia, Alex, Cuperlovic-Culf, Miroslava, Bengio, Yoshua, Simine, Lena

论文摘要

短单链RNA和DNA序列（适体）的逆设计是找到满足一组所需标准的序列的任务。例如，相关标准可能是存在特定折叠基序的存在，与分子配体，感应属性等结合。大多数实用的适体设计方法可以通过使用高通量实验（例如SELEX）（例如SELEX）（例如SELEX）来确定一系列有希望的候选序列，并通过仅将较小的次要效果引入Empiration the Empure the Empirate Sandicates。具有所需特性但化学成分截然不同的序列将为搜索空间增加多样性，并促进发现有用的核酸适体。需要系统的多元化协议。在这里，我们建议使用一种无监督的机器学习模型（称为Potts模型）来发现具有可控序列多样性的新的有用序列。我们首先使用最大熵原理训练POTTS模型，这是一组由共同特征统一的经验鉴定的序列。为了生成具有可控多样性程度的新候选序列，我们利用了模型的光谱特征：能量带隙分离序列，与训练集相似，与训练集相似。通过控制采样的POTTS能量范围，我们生成的序列与训练集不同，但仍然可能具有编码的特征。为了证明性能，我们将方法应用于设计不同的序列池，该序列具有30-mer RNA和DNA适体中的特定二级结构基序。

Inverse design of short single-stranded RNA and DNA sequences (aptamers) is the task of finding sequences that satisfy a set of desired criteria. Relevant criteria may be, for example, the presence of specific folding motifs, binding to molecular ligands, sensing properties, etc. Most practical approaches to aptamer design identify a small set of promising candidate sequences using high-throughput experiments (e.g. SELEX), and then optimize performance by introducing only minor modifications to the empirically found candidates. Sequences that possess the desired properties but differ drastically in chemical composition will add diversity to the search space and facilitate the discovery of useful nucleic acid aptamers. Systematic diversification protocols are needed. Here we propose to use an unsupervised machine learning model known as the Potts model to discover new, useful sequences with controllable sequence diversity. We start by training a Potts model using the maximum entropy principle on a small set of empirically identified sequences unified by a common feature. To generate new candidate sequences with a controllable degree of diversity, we take advantage of the model's spectral feature: an energy bandgap separating sequences that are similar to the training set from those that are distinct. By controlling the Potts energy range that is sampled, we generate sequences that are distinct from the training set yet still likely to have the encoded features. To demonstrate performance, we apply our approach to design diverse pools of sequences with specified secondary structure motifs in 30-mer RNA and DNA aptamers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题