固体：用于进攻性语言标识的大规模半监督数据集

论文标题

固体：用于进攻性语言标识的大规模半监督数据集

SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification

论文作者

Rosenthal, Sara, Atanasova, Pepa, Karadzhov, Georgi, Zampieri, Marcos, Nakov, Preslav

论文摘要

社交媒体中对进攻内容的广泛使用导致在检测诸如仇恨言论，网络欺凌和网络攻击之类的语言方面进行了丰富的研究。最近的工作介绍了OLID数据集，该数据集遵循了进攻性语言标识的分类法，该分类学提供了有意义的信息，以了解进攻性信息的类型和目标。但是，它的大小有限，并且由于使用关键字收集，它可能会偏向进攻性语言。在这项工作中，我们提出了一个扩展的数据集的固体，在该数据集中，这些推文是以更有原则的方式收集的。 Solid包含以半监督方式标记的900万英文推文。我们证明，使用固体与OLID一起在两个不同模型的OLID测试集上产生可观的性能提高，尤其是对于较低的分类法级别。

The widespread use of offensive content in social media has led to an abundance of research in detecting language such as hate speech, cyberbullying, and cyber-aggression. Recent work presented the OLID dataset, which follows a taxonomy for offensive language identification that provides meaningful information for understanding the type and the target of offensive messages. However, it is limited in size and it might be biased towards offensive language as it was collected using keywords. In this work, we present SOLID, an expanded dataset, where the tweets were collected in a more principled manner. SOLID contains over nine million English tweets labeled in a semi-supervised fashion. We demonstrate that using SOLID along with OLID yields sizable performance gains on the OLID test set for two different models, especially for the lower levels of the taxonomy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题