论文标题
在资源不足的阿尔及利亚方言阿拉伯语中的进攻性语言检测
Offensive Language Detection in Under-resourced Algerian Dialectal Arabic Language
论文作者
论文摘要
本文解决了在Facebook评论中检测进攻性和虐待内容的问题,我们将重点介绍了阿尔及利亚言语阿拉伯语,这是资源不足的语言之一。后者有多种方言与不同的语言混合(即柏柏尔语,法语和英语)。此外,我们处理用阿拉伯语和罗马文字(即阿拉伯)编写的文本。由于对同一语言的作品缺乏,我们建立了一个新的语料库,重新分组了超过8.7k的文本,这些文本被手动注释为正常,虐待和令人反感。我们使用文本分类的最新分类器进行了一系列实验,即:Bilstm,CNN,FastText,SVM和NB。结果显示出可接受的性能,但是问题需要进一步研究语言特征,以提高识别精度。
This paper addresses the problem of detecting the offensive and abusive content in Facebook comments, where we focus on the Algerian dialectal Arabic which is one of under-resourced languages. The latter has a variety of dialects mixed with different languages (i.e. Berber, French and English). In addition, we deal with texts written in both Arabic and Roman scripts (i.e. Arabizi). Due to the scarcity of works on the same language, we have built a new corpus regrouping more than 8.7k texts manually annotated as normal, abusive and offensive. We have conducted a series of experiments using the state-of-the-art classifiers of text categorisation, namely: BiLSTM, CNN, FastText, SVM and NB. The results showed acceptable performances, but the problem requires further investigation on linguistic features to increase the identification accuracy.