论文标题
12个用于文本分类的机器学习模型的管道和比较研究
A pipeline and comparative study of 12 machine learning models for text classification
论文作者
论文摘要
基于文本的通信是一种通信方法,尤其是在商业环境中。结果,通常通过发送恶意消息,例如垃圾邮件电子邮件来滥用它,以欺骗用户传递个人信息,包括在线帐户凭据或银行详细信息。因此,已经提出了许多用于文本分类的机器学习方法,并将其纳入大多数电子邮件提供商的服务中。但是,优化文本分类算法并在其侵略性上找到正确的权衡仍然是一个主要的研究问题。 我们提出了针对公共垃圾邮件语料库的12个机器学习文本分类器的更新调查。提出了一条新的管道,以优化超参数选择并通过在预处理阶段应用特定方法(基于自然语言处理)来改善模型的性能。 我们的研究旨在提供一种新方法,以调查和优化机器学习分类器中不同特征大小和超参数的影响,这些特征大小和超参数在文本分类问题中广泛使用。对分类器进行测试和评估,并在不同的指标上进行评估,包括F评分(准确性),精度,召回时间和运行时间。通过分析所有这些方面,我们展示了如何使用所提出的管道来实现良好的准确性,以在Enron数据集(一种广泛使用的公共电子邮件语料库)上进行垃圾邮件过滤。应用统计测试和解释性技术来提供对拟议管道的强大分析,并解释12个机器学习模型的分类结果,还识别驱动分类结果的单词。我们的分析表明,可以识别有效的机器学习模型,以94%的F评分对安然数据集进行分类。
Text-based communication is highly favoured as a communication method, especially in business environments. As a result, it is often abused by sending malicious messages, e.g., spam emails, to deceive users into relaying personal information, including online accounts credentials or banking details. For this reason, many machine learning methods for text classification have been proposed and incorporated into the services of most email providers. However, optimising text classification algorithms and finding the right tradeoff on their aggressiveness is still a major research problem. We present an updated survey of 12 machine learning text classifiers applied to a public spam corpus. A new pipeline is proposed to optimise hyperparameter selection and improve the models' performance by applying specific methods (based on natural language processing) in the preprocessing stage. Our study aims to provide a new methodology to investigate and optimise the effect of different feature sizes and hyperparameters in machine learning classifiers that are widely used in text classification problems. The classifiers are tested and evaluated on different metrics including F-score (accuracy), precision, recall, and run time. By analysing all these aspects, we show how the proposed pipeline can be used to achieve a good accuracy towards spam filtering on the Enron dataset, a widely used public email corpus. Statistical tests and explainability techniques are applied to provide a robust analysis of the proposed pipeline and interpret the classification outcomes of the 12 machine learning models, also identifying words that drive the classification results. Our analysis shows that it is possible to identify an effective machine learning model to classify the Enron dataset with an F-score of 94%.