创建图像和文本的多模式数据集来研究滥用语言

论文标题

创建图像和文本的多模式数据集来研究滥用语言

Creating a Multimodal Dataset of Images and Text to Study Abusive Language

论文作者

Aprosio, Alessio Palmero, Menini, Stefano, Tonelli, Sara

论文摘要

为了研究在线仇恨言论，包含感兴趣的语言现象的数据集的可用性至关重要。但是，当涉及到特定目标群体时，例如青少年，由于同意和隐私限制，收集此类数据可能会出现问题。此外，尽管此类数据集已被广泛使用，但由Instagram（Instagram）基于图像的社交媒体平台设定的限制使研究人员难以尝试多模式仇恨语音数据。因此，我们开发了CREENDER，这是一种注释工具，已在学校课程中使用，以创建图像和滥用评论的多模式数据集，我们在Apache 2.0许可证中免费提供。语料库以意大利语评论从不同的角度进行了分析，以调查图像的主题是否在触发评论中起作用。我们发现用户以不同的方式判断相同的图像，尽管图片中的某人的存在增加了获得进攻性评论的可能性。

In order to study online hate speech, the availability of datasets containing the linguistic phenomena of interest are of crucial importance. However, when it comes to specific target groups, for example teenagers, collecting such data may be problematic due to issues with consent and privacy restrictions. Furthermore, while text-only datasets of this kind have been widely used, limitations set by image-based social media platforms like Instagram make it difficult for researchers to experiment with multimodal hate speech data. We therefore developed CREENDER, an annotation tool that has been used in school classes to create a multimodal dataset of images and abusive comments, which we make freely available under Apache 2.0 license. The corpus, with Italian comments, has been analysed from different perspectives, to investigate whether the subject of the images plays a role in triggering a comment. We find that users judge the same images in different ways, although the presence of a person in the picture increases the probability to get an offensive comment.

下载PDF全文

下载文献需遵守相关版权规定

论文标题