论文标题
初步研究特征密度和语言支持的嵌入以改善基于机器学习的网络欺凌检测
Initial Study into Application of Feature Density and Linguistically-backed Embedding to Improve Machine Learning-based Cyberbullying Detection
论文作者
论文摘要
在这项研究中,我们研究了使用数据集的各种语言预处理方法的机器学习(ML)分类器的性能变化,并具体侧重于卷积神经网络(CNN)中语言支持的嵌入。此外,我们研究特征密度的概念,并确认其潜力相对预测包括CNN在内的ML分类器的性能。这项研究是对在Kaggle竞争中提供的有关自动网络欺凌检测的Formspring数据集进行的。该数据集已由客观专家(心理学家)重新注销,因为已经多次指出了网络欺凌研究中专业注释的重要性。该研究证实了神经网络在网络欺凌检测中的有效性以及分类器性能和特征密度之间的相关性,同时还提出了一种新的方法,以训练各种语言背景的嵌入卷积神经网络。
In this research, we study the change in the performance of machine learning (ML) classifiers when various linguistic preprocessing methods of a dataset were used, with the specific focus on linguistically-backed embeddings in Convolutional Neural Networks (CNN). Moreover, we study the concept of Feature Density and confirm its potential to comparatively predict the performance of ML classifiers, including CNN. The research was conducted on a Formspring dataset provided in a Kaggle competition on automatic cyberbullying detection. The dataset was re-annotated by objective experts (psychologists), as the importance of professional annotation in cyberbullying research has been indicated multiple times. The study confirmed the effectiveness of Neural Networks in cyberbullying detection and the correlation between classifier performance and Feature Density while also proposing a new approach of training various linguistically-backed embeddings for Convolutional Neural Networks.