论文标题
深入研究Virustotal:表征和聚类大量文件供稿
A Deep Dive into VirusTotal: Characterizing and Clustering a Massive File Feed
论文作者
论文摘要
在线扫描仪使用大量的安全工具分析用户提交的文件,并提供分析结果的访问权限。作为最受欢迎的在线扫描仪,Virustotal(VT)通常用于确定样本是否是恶意的,与家人一起标记样品,寻找新的威胁并收集恶意软件样本。我们通过VT文件提要分析了3.28亿VT报告,分别为23500万样本收集了一年。我们使用这些报告对VT文件进行深入表征,并将其与大型安全供应商的遥测进行比较。我们回答诸如饲料多样性之类的问题?它是否允许为不同的Filetypes构建恶意软件数据集?样品提供的新鲜有多新鲜?它看到的恶意软件家庭的分布是什么?该分布确实代表用户设备上的恶意软件吗? 然后,我们通过研究可以在23500万饲料样品上产生高纯度簇的可扩展方法来探索如何大规模进行威胁狩猎。我们研究了三种聚类方法:分层集聚聚类(HAC),TLSH摘要(HAC-T)的更可扩展的HAC变体以及简单的特征值分组(FVG)。我们的结果表明,使用选定功能的HAC-T和FVG在地面真相数据集上产生高精度群集。但是,只有FVG缩放为饲料中样品的每日涌入。此外,FVG需要15个小时才能将整个23500万样本的数据集聚集。最后,我们使用生产的簇进行威胁狩猎,即检测到190k样品被认为是良性的(即用零检测到的),这可能确实是恶意的,因为它们属于大多数样品被发现为恶意的29k群集。
Online scanners analyze user-submitted files with a large number of security tools and provide access to the analysis results. As the most popular online scanner, VirusTotal (VT) is often used for determining if samples are malicious, labeling samples with their family, hunting for new threats, and collecting malware samples. We analyze 328M VT reports for 235M samples collected for one year through the VT file feed. We use the reports to characterize the VT file feed in depth and compare it with the telemetry of a large security vendor. We answer questions such as How diverse is the feed? Does it allow building malware datasets for different filetypes? How fresh are the samples it provides? What is the distribution of malware families it sees? Does that distribution really represent malware on user devices? We then explore how to perform threat hunting at scale by investigating scalable approaches that can produce high purity clusters on the 235M feed samples. We investigate three clustering approaches: hierarchical agglomerative clustering (HAC), a more scalable HAC variant for TLSH digests (HAC-T), and a simple feature value grouping (FVG). Our results show that HAC-T and FVG using selected features produce high precision clusters on ground truth datasets. However, only FVG scales to the daily influx of samples in the feed. Moreover, FVG takes 15 hours to cluster the whole dataset of 235M samples. Finally, we use the produced clusters for threat hunting, namely for detecting 190K samples thought to be benign (i.e., with zero detections) that may really be malicious because they belong to 29K clusters where most samples are detected as malicious.