论文标题
使用综合数据平衡数据的影响机器学习分类器的性能,以便在计算机网络中进行入侵检测
Effect of Balancing Data Using Synthetic Data on the Performance of Machine Learning Classifiers for Intrusion Detection in Computer Networks
论文作者
论文摘要
最近几天,对计算机网络的攻击大大增加,部分原因是为发动攻击以及蓬勃发展的地下网络犯罪经济提供了复杂的工具。在过去的几年中,学术界和行业的研究人员使用机器学习(ML)技术来设计和实施计算机网络的入侵检测系统(IDSE)。这些研究人员中有许多使用各种组织收集的数据集来培训ML模型以预测入侵。在此类系统中使用的许多数据集中,数据是不平衡的(即,并非所有类都有相等数量的样本)。借助不平衡的数据,使用ML算法开发的预测模型可能会产生不满意的分类器,从而影响预测入侵的准确性。传统上,研究人员使用过度采样和采样量来平衡数据集中的数据来克服这个问题。在这项工作中,除了过度采样外,我们还使用一种称为条件生成对抗网络(CTGAN)的合成数据生成方法来平衡数据并研究其对各种ML分类器的影响。据我们所知,没有其他人使用CTGAN来生成合成样本来平衡入侵检测数据集。基于使用广泛使用的数据集NSL-KDD进行的广泛实验,我们发现数据集上的训练ML模型与CTGAN生成的合成样品平衡的ML模型将预测准确性提高了高达$ 8 \%$,而不是训练相同的ML模型,而不是不平衡的数据。我们的实验还表明,与在不平衡数据中训练的相同ML模型相比,一些ML模型在数据平衡与随机过度下降的数据之间的准确性。
Attacks on computer networks have increased significantly in recent days, due in part to the availability of sophisticated tools for launching such attacks as well as thriving underground cyber-crime economy to support it. Over the past several years, researchers in academia and industry used machine learning (ML) techniques to design and implement Intrusion Detection Systems (IDSes) for computer networks. Many of these researchers used datasets collected by various organizations to train ML models for predicting intrusions. In many of the datasets used in such systems, data are imbalanced (i.e., not all classes have equal amount of samples). With unbalanced data, the predictive models developed using ML algorithms may produce unsatisfactory classifiers which would affect accuracy in predicting intrusions. Traditionally, researchers used over-sampling and under-sampling for balancing data in datasets to overcome this problem. In this work, in addition to over-sampling, we also use a synthetic data generation method, called Conditional Generative Adversarial Network (CTGAN), to balance data and study their effect on various ML classifiers. To the best of our knowledge, no one else has used CTGAN to generate synthetic samples to balance intrusion detection datasets. Based on extensive experiments using a widely used dataset NSL-KDD, we found that training ML models on dataset balanced with synthetic samples generated by CTGAN increased prediction accuracy by up to $8\%$, compared to training the same ML models over unbalanced data. Our experiments also show that the accuracy of some ML models trained over data balanced with random over-sampling decline compared to the same ML models trained over unbalanced data.