论文标题

Quo Vadis:基于上下文和行为恶意软件表示的混合机器学习元模型

Quo Vadis: Hybrid Machine Learning Meta-Model based on Contextual and Behavioral Malware Representations

论文作者

Trizna, Dmitrijs

论文摘要

我们提出了一个混合机器学习体系结构,该体系结构同时采用多个深度学习模型,分析Windows便携式可执行文件的上下文和行为特征,从而根据Meta模型的决定产生最终预测。当代机器学习Windows恶意软件分类器中的检测启发式启发式基于样本的静态特性,因为通过虚拟化动态分析对于大量样本而言是挑战性的。为了超越这一限制,我们采用了Windows内核仿真,该模拟允许以最低的时间和计算成本来获取大型语料库中的行为模式。我们与安全供应商合作,收集了类似于当代威胁景观的100K INT型样品,在执行时包含RAW PE文件和应用程序的文件播放。获得的数据集至少比行为恶意软件分析的相关工作中报告的十倍大。培训数据集中的文件由专业威胁情报团队标记,使用手动和自动化的逆向工程工具。我们通过收集培训集的收购来估算混合分类器的运营实用程序。我们报告了提高的检测率,高于当前最新模型的功能,尤其是在低阳性要求下。此外,即使没有一个单个模型表达足够的信心来将样本标记为恶意,我们也发现了元模型在验证和测试集中识别恶意活动的能力。我们得出的结论是,元模型可以从不同分析技术产生的表示组合中学习典型的恶意样本模式。我们公开发布了预先培训的模型和仿真报告的匿名数据集。

We propose a hybrid machine learning architecture that simultaneously employs multiple deep learning models analyzing contextual and behavioral characteristics of Windows portable executable, producing a final prediction based on a decision from the meta-model. The detection heuristic in contemporary machine learning Windows malware classifiers is typically based on the static properties of the sample since dynamic analysis through virtualization is challenging for vast quantities of samples. To surpass this limitation, we employ a Windows kernel emulation that allows the acquisition of behavioral patterns across large corpora with minimal temporal and computational costs. We partner with a security vendor for a collection of more than 100k int-the-wild samples that resemble the contemporary threat landscape, containing raw PE files and filepaths of applications at the moment of execution. The acquired dataset is at least ten folds larger than reported in related works on behavioral malware analysis. Files in the training dataset are labeled by a professional threat intelligence team, utilizing manual and automated reverse engineering tools. We estimate the hybrid classifier's operational utility by collecting an out-of-sample test set three months later from the acquisition of the training set. We report an improved detection rate, above the capabilities of the current state-of-the-art model, especially under low false-positive requirements. Additionally, we uncover a meta-model's ability to identify malicious activity in validation and test sets even if none of the individual models express enough confidence to mark the sample as malevolent. We conclude that the meta-model can learn patterns typical to malicious samples from representation combinations produced by different analysis techniques. We publicly release pre-trained models and anonymized dataset of emulation reports.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源