评估伪造新闻检测的微调模型的普遍性

论文标题

评估伪造新闻检测的微调模型的普遍性

Evaluating Generalizability of Fine-Tuned Models for Fake News Detection

论文作者

Suprem, Abhijit, Pu, Calton

论文摘要

COVID-19大流行引起了危险错误信息的急剧和平行的增长，它表示CDC和谁的“ Infodemic”。与COVID-19的不断变化相关的错误信息不断变化；这可能会导致由于概念漂移而导致微调模型的性能下降。如果模型概括以捕获漂移数据的某些周期性方面，则可以减轻降级。在本文中，我们探讨了9个假新闻数据集中预训练和微调的假新闻探测器的普遍性。我们表明，现有的模型通常在其培训数据集中过度拟合，并且在看不见的数据上的性能较差。但是，在与培训数据重叠的一些看不见的数据子集中，模型的精度更高。基于此观察结果，我们还提出了Kmeans-Proxy，这是一种基于K-均值聚类的快速有效方法，可快速识别这些看不见数据的重叠子集。 Kmeans-Proxy在数据集中提高了看不见的虚假新闻数据集的概括性。我们介绍了我们的概括性实验以及Kmeans-Proxy，以进一步研究解决假新闻问题。

The Covid-19 pandemic has caused a dramatic and parallel rise in dangerous misinformation, denoted an `infodemic' by the CDC and WHO. Misinformation tied to the Covid-19 infodemic changes continuously; this can lead to performance degradation of fine-tuned models due to concept drift. Degredation can be mitigated if models generalize well-enough to capture some cyclical aspects of drifted data. In this paper, we explore generalizability of pre-trained and fine-tuned fake news detectors across 9 fake news datasets. We show that existing models often overfit on their training dataset and have poor performance on unseen data. However, on some subsets of unseen data that overlap with training data, models have higher accuracy. Based on this observation, we also present KMeans-Proxy, a fast and effective method based on K-Means clustering for quickly identifying these overlapping subsets of unseen data. KMeans-Proxy improves generalizability on unseen fake news datasets by 0.1-0.2 f1-points across datasets. We present both our generalizability experiments as well as KMeans-Proxy to further research in tackling the fake news problem.

下载PDF全文

下载文献需遵守相关版权规定

论文标题