论文标题
ASAP-SML:使用统计测试和机器学习的抗体序列分析管道
ASAP-SML: An Antibody Sequence Analysis Pipeline Using Statistical Testing and Machine Learning
论文作者
论文摘要
抗体能够有效,特异性地结合单个抗原,在某些情况下会破坏其功能。产生基于抗体的抑制剂的主要挑战是缺乏将抗体序列与其独特特性作为抑制剂有关的基本信息。我们使用统计测试和机器学习(ASAP-SML)开发管道,抗体序列分析管道,以识别将一组抗体序列与参考集中区分抗体序列的特征。管道提取物具有来自序列的指纹。指纹代表种系,CDR规范结构,等电点和频繁的位置基序。机器学习和统计显着性测试技术应用于抗体序列和提取的特征指纹,以识别区分特征值及其组合。为了证明其工作原理,我们将管道应用于已知结合或抑制基质金属蛋白酶(MMP)活性的抗体序列集,这是一个锌依赖性酶的家族,这些酶的家族促进癌症的进展,并在病理条件下促进癌症进展和不良炎症,并在病理学条件下,对不结合或抑制MMP的参考数据集中的病理学数据集。 ASAP-SML识别在MMP靶向集中发现的特征和组合,这些特征值与参考集中的特征值不同。
Antibodies are capable of potently and specifically binding individual antigens and, in some cases, disrupting their functions. The key challenge in generating antibody-based inhibitors is the lack of fundamental information relating sequences of antibodies to their unique properties as inhibitors. We develop a pipeline, Antibody Sequence Analysis Pipeline using Statistical testing and Machine Learning (ASAP-SML), to identify features that distinguish one set of antibody sequences from antibody sequences in a reference set. The pipeline extracts feature fingerprints from sequences. The fingerprints represent germline, CDR canonical structure, isoelectric point and frequent positional motifs. Machine learning and statistical significance testing techniques are applied to antibody sequences and extracted feature fingerprints to identify distinguishing feature values and combinations thereof. To demonstrate how it works, we applied the pipeline on sets of antibody sequences known to bind or inhibit the activities of matrix metalloproteinases (MMPs), a family of zinc-dependent enzymes that promote cancer progression and undesired inflammation under pathological conditions, against reference datasets that do not bind or inhibit MMPs. ASAP-SML identifies features and combinations of feature values found in the MMP-targeting sets that are distinct from those in the reference sets.