论文标题
请不要谣言!一种多印度语言方法,用于伪造TWEET检测
No Rumours Please! A Multi-Indic-Lingual Approach for COVID Fake-Tweet Detection
论文作者
论文摘要
当今全球大流行19造成的突然广泛的威胁对我们的生活产生了前所未有的影响。人类正经历着巨大的恐惧和对社交媒体的依赖。恐惧不可避免地导致恐慌,猜测和错误信息传播。许多政府已采取措施遏制这种错误信息的传播,以弥补公众的福祉。除了全球措施外,要进行有效的宣传,人口统计学上的本地语言系统在这项工作中起着重要的作用。在此方面,我们提出了一种方法,以从社交媒体(例如Tweets)的早期发现有关Covid-19的假新闻,除了英语外,还针对多种指示语言。此外,我们还创建了印地语和孟加拉语推文的注释数据集,以进行虚假新闻检测。我们提出了一个基于BERT的模型,并增强了从Twitter提取的其他相关功能,以识别假推文。为了将我们的方法扩展到多种指示语言,我们求助于基于Mbert的模型,该模型对印地语和孟加拉语中创建的数据集进行了微调。我们还提出了一种零拍的学习方法,以减轻这种低资源语言的数据稀缺问题。通过严格的实验,我们表明我们的方法在假推文检测中达到了约89%的F评分,从而取代了最先进的结果(SOTA)。此外,我们建立了两个指示语言的第一个基准,即印地语和孟加拉语。使用我们的注释数据,我们的模型在印地语中达到了约79%的F评分,而孟加拉语推文的F-评分为81%。我们的零射模型在印地语中达到了约81%的F-评分,而孟加拉语的F-评分为78%,没有任何带注释的数据,这清楚地表明了我们方法的功效。
The sudden widespread menace created by the present global pandemic COVID-19 has had an unprecedented effect on our lives. Man-kind is going through humongous fear and dependence on social media like never before. Fear inevitably leads to panic, speculations, and the spread of misinformation. Many governments have taken measures to curb the spread of such misinformation for public well being. Besides global measures, to have effective outreach, systems for demographically local languages have an important role to play in this effort. Towards this, we propose an approach to detect fake news about COVID-19 early on from social media, such as tweets, for multiple Indic-Languages besides English. In addition, we also create an annotated dataset of Hindi and Bengali tweet for fake news detection. We propose a BERT based model augmented with additional relevant features extracted from Twitter to identify fake tweets. To expand our approach to multiple Indic languages, we resort to mBERT based model which is fine-tuned over created dataset in Hindi and Bengali. We also propose a zero-shot learning approach to alleviate the data scarcity issue for such low resource languages. Through rigorous experiments, we show that our approach reaches around 89% F-Score in fake tweet detection which supercedes the state-of-the-art (SOTA) results. Moreover, we establish the first benchmark for two Indic-Languages, Hindi and Bengali. Using our annotated data, our model achieves about 79% F-Score in Hindi and 81% F-Score for Bengali Tweets. Our zero-shot model achieves about 81% F-Score in Hindi and 78% F-Score for Bengali Tweets without any annotated data, which clearly indicates the efficacy of our approach.