注意差距：弥合机器学习与信息安全之间的语义差距

论文标题

注意差距：弥合机器学习与信息安全之间的语义差距

Mind the Gap: On Bridging the Semantic Gap between Machine Learning and Information Security

论文作者

Smith, Michael R., Johnson, Nicholas T., Ingram, Joe B., Carbajal, Armida J., Ramyaa, Ramyaa, Domschot, Evelyn, Lamb, Christopher C., Verzi, Stephen J., Kegelmeyer, W. Philip

论文摘要

尽管机器学习的潜力（ML）有可能学习恶意软件的行为，检测新的恶意软件样本并显着提高信息安全性（INFOSEC），但在部署的系统中，尽管有多次开放文献取得成功，但我们看到部署系统中的高影响力ML技术很少。我们假设ML在INFOSEC中产生高影响力的失败植根于两个社区之间的断开连接，这是语义差距所证明的 - 描述了可执行文件的描述方式（例如，从数据中提取的数据和功能）的差异。具体而言，ML使用的当前数据集和表示不适合学习可执行文件的行为，并且与Infosec社区使用的行为有很大差异。在本文中，我们调查用于通过ML算法对恶意软件进行分类的现有数据集以及从数据中提取的功能。我们观察到：1）当前提取的特征集主要是句法，而不是行为，2）数据集通常包含产生一个易于区分类别的数据集的极端示例，而3）数据集提供了在现实世界系统中遇到的数据的显着不同。为了使ML对Infosec社区产生更多影响，需要更改数据（包括功能和标签），用于弥合当前的语义差距。作为启用更多行为分析的第一步，我们使用与恶意软件家族相关的开源威胁报告将现有的恶意软件数据集标记为行为功能。这种行为标签会改变分析，从识别意图（例如好与坏）或恶意软件家庭成员身份转变为可执行文件展示哪些行为的分析。我们提供注释，希望激发数据的未来改进，这将进一步弥合ML和Infosec社区之间的语义差距。

Despite the potential of Machine learning (ML) to learn the behavior of malware, detect novel malware samples, and significantly improve information security (InfoSec) we see few, if any, high-impact ML techniques in deployed systems, notwithstanding multiple reported successes in open literature. We hypothesize that the failure of ML in making high-impacts in InfoSec are rooted in a disconnect between the two communities as evidenced by a semantic gap---a difference in how executables are described (e.g. the data and features extracted from the data). Specifically, current datasets and representations used by ML are not suitable for learning the behaviors of an executable and differ significantly from those used by the InfoSec community. In this paper, we survey existing datasets used for classifying malware by ML algorithms and the features that are extracted from the data. We observe that: 1) the current set of extracted features are primarily syntactic, not behavioral, 2) datasets generally contain extreme exemplars producing a dataset in which it is easy to discriminate classes, and 3) the datasets provide significantly different representations of the data encountered in real-world systems. For ML to make more of an impact in the InfoSec community requires a change in the data (including the features and labels) that is used to bridge the current semantic gap. As a first step in enabling more behavioral analyses, we label existing malware datasets with behavioral features using open-source threat reports associated with malware families. This behavioral labeling alters the analysis from identifying intent (e.g. good vs bad) or malware family membership to an analysis of which behaviors are exhibited by an executable. We offer the annotations with the hope of inspiring future improvements in the data that will further bridge the semantic gap between the ML and InfoSec communities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题