论文标题
使用基于邻骨的网络模型评估COVID-19序列数据
Evaluating COVID-19 Sequence Data Using Nearest-Neighbors Based Network Model
论文作者
论文摘要
SARS-COV-2冠状病毒是人类Covid-19疾病的原因。像许多冠状病毒一样,它可以适应不同的宿主并演变为不同的谱系。众所周知,主要的SARS-COV-2谱系的特征是突变主要发生在峰值蛋白中。了解峰值蛋白结构及其如何扰动对于理解和确定谱系是否引起关注至关重要。这些对于识别和控制当前的暴发并防止未来的大流行至关重要。鉴于可用的测序数据的数量,机器学习(ML)方法是对这项工作的可行解决方案,其中大部分是未对准甚至未组装的。但是,此类ML方法需要适用于欧几里得空间中的固定长度数值向量。同样,在处理生物序列的分类和聚类任务时,欧几里得空间也不是最佳选择。为此,我们设计了一种将蛋白质(SPIKE)序列转换为序列相似性网络(SSN)的方法。然后,我们可以将SSN用作典型任务(例如分类和聚类)的图形挖掘域的经典算法的输入来了解数据。我们表明,在聚类结果方面,提出的无对齐方法能够胜过当前的SOTA方法。同样,与其他基线嵌入方法相比,我们能够使用众所周知的基于Node2VEC的嵌入来实现更高的分类精度。
The SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans. Like many coronaviruses, it can adapt to different hosts and evolve into different lineages. It is well-known that the major SARS-CoV-2 lineages are characterized by mutations that happen predominantly in the spike protein. Understanding the spike protein structure and how it can be perturbed is vital for understanding and determining if a lineage is of concern. These are crucial to identifying and controlling current outbreaks and preventing future pandemics. Machine learning (ML) methods are a viable solution to this effort, given the volume of available sequencing data, much of which is unaligned or even unassembled. However, such ML methods require fixed-length numerical feature vectors in Euclidean space to be applicable. Similarly, euclidean space is not considered the best choice when working with the classification and clustering tasks for biological sequences. For this purpose, we design a method that converts the protein (spike) sequences into the sequence similarity network (SSN). We can then use SSN as an input for the classical algorithms from the graph mining domain for the typical tasks such as classification and clustering to understand the data. We show that the proposed alignment-free method is able to outperform the current SOTA method in terms of clustering results. Similarly, we are able to achieve higher classification accuracy using well-known Node2Vec-based embedding compared to other baseline embedding approaches.