论文标题
从电报收集错误信息数据的多模式管道
Multimodal Pipeline for Collection of Misinformation Data from Telegram
论文作者
论文摘要
本文介绍了AI-COVID19的结果,我们的项目旨在更好地了解跨社交媒体平台的Covid-19的错误信息流。本文报道的研究的具体重点是从有效促进共同相关错误信息的电报组中收集数据。到目前为止,我们的语料库收集到了大约2800万个单词,来自近一百万条消息。鉴于社交媒体中的大量错误信息流是通过多模式手段(例如图像和视频)传播的,我们还开发了一种机制,可以通过生成自动转录本来用于视频和自动分类,以将图像自动分类为模因,帖子的屏幕截图和其他类型的图像。图像分类管道的准确性约为87%。
The paper presents the outcomes of AI-COVID19, our project aimed at better understanding of misinformation flow about COVID-19 across social media platforms. The specific focus of the study reported in this paper is on collecting data from Telegram groups which are active in promotion of COVID-related misinformation. Our corpus collected so far contains around 28 million words, from almost one million messages. Given that a substantial portion of misinformation flow in social media is spread via multimodal means, such as images and video, we have also developed a mechanism for utilising such channels via producing automatic transcripts for videos and automatic classification for images into such categories as memes, screenshots of posts and other kinds of images. The accuracy of the image classification pipeline is around 87%.