使用监督的机器学习和功能组合，乌尔都语中的滥用和威胁性语言检测

论文标题

使用监督的机器学习和功能组合，乌尔都语中的滥用和威胁性语言检测

Abusive and Threatening Language Detection in Urdu using Supervised Machine Learning and Feature Combinations

论文作者

Humayoun, Muhammad

论文摘要

本文介绍了在Fire Sopary Tasge 2021上提交的系统描述，这些描述涉及乌尔都语的滥用和威胁性语言检测任务。这项挑战旨在自动确定在乌尔都语中撰写的虐待和威胁性推文。我们提交的结果被选为比赛中的第三个认可。本文报告了一份非详尽的实验清单，使我们能够达到提交的结果。此外，在竞争的结果声明之后，我们设法取得了比提交结果更好的结果。我们的模型在任务A上获得了0.8318 F1得分（乌尔都语推文的滥用语言检测）和任务B的0.4931 F1分数（乌尔都语推文的威胁性语言检测）。结果表明，支撑向量机器的止动，应用lemmatization以及由n = 1,2,3的单词n-gram组合创建的矢量和矢量为任务B产生了最佳结果A。对于任务B，支持向量机的端子机器，未删除句号，lemmatization n offed offector offector，lemmatization n opply offection n offection ins offection ins offecter magipt and offection。使用过采样技术平衡产生了最佳结果。该代码可用于可重复性。

This paper presents the system descriptions submitted at the FIRE Shared Task 2021 on Urdu's Abusive and Threatening Language Detection Task. This challenge aims at automatically identifying abusive and threatening tweets written in Urdu. Our submitted results were selected for the third recognition at the competition. This paper reports a non-exhaustive list of experiments that allowed us to reach the submitted results. Moreover, after the result declaration of the competition, we managed to attain even better results than the submitted results. Our models achieved 0.8318 F1 score on Task A (Abusive Language Detection for Urdu Tweets) and 0.4931 F1 score on Task B (Threatening Language Detection for Urdu Tweets). Results show that Support Vector Machines with stopwords removed, lemmatization applied, and features vector created by the combinations of word n-grams for n=1,2,3 produced the best results for Task A. For Task B, Support Vector Machines with stopwords removed, lemmatization not applied, feature vector created from a pre-trained Urdu Word2Vec (on word unigrams and bigrams), and making the dataset balanced using oversampling technique produced the best results. The code is made available for reproducibility.

下载PDF全文

下载文献需遵守相关版权规定

论文标题