使用标签改进来构建社交媒体话语中标记数据集的大规模错误信息

论文标题

使用标签改进来构建社交媒体话语中标记数据集的大规模错误信息

Construction of Large-Scale Misinformation Labeled Datasets from Social Media Discourse using Label Refinement

论文作者

Sharma, Karishma, Ferrara, Emilio, Liu, Yan

论文摘要

近来，尤其是在19日大流行期间，恶意叙述传播错误的信息导致了广泛的虚假和误导性叙事，而社交媒体平台则努力迅速消除这些内容。这是因为适应新领域需要缓慢且难以扩展的人类密集的事实检查。为了应对这一挑战，我们建议将新闻源信誉标签作为社交媒体帖子的薄弱标签，并提出对标签的模型引导的完善，以构建新领域中标记为数据集的大规模，多样化的错误信息。在文章或社交媒体帖子级别上，弱标签可能不准确，因为用户的立场与新闻来源或文章的信誉不符。我们提出了一个框架，以根据模型预测中的熵进行不确定性采样的初始弱标签上使用检测模型，以识别潜在的不准确的标签，并使用自学或重新标记为其校正它们。该框架将根据其相关用户的社区的社区来融合该帖子的社会环境，以表现出不准确的标签，以构建以人为最少的努力来构建大型数据集。为了为标记的数据集提供误导性叙述的区别，在这些叙述中，信息可能缺少重要的上下文或不准确的辅助细节，该拟议的框架将使用少数标记的样本作为类原型，将高信心样本分开为虚假的，未证实的，未经证实的混合物，大多数是错误的，主要是真实的，是真实的，是真实的，和真实的，并且是删除信息。该方法用于在COVID-19疫苗上提供大规模的错误信息数据集。

Malicious accounts spreading misinformation has led to widespread false and misleading narratives in recent times, especially during the COVID-19 pandemic, and social media platforms struggle to eliminate these contents rapidly. This is because adapting to new domains requires human intensive fact-checking that is slow and difficult to scale. To address this challenge, we propose to leverage news-source credibility labels as weak labels for social media posts and propose model-guided refinement of labels to construct large-scale, diverse misinformation labeled datasets in new domains. The weak labels can be inaccurate at the article or social media post level where the stance of the user does not align with the news source or article credibility. We propose a framework to use a detection model self-trained on the initial weak labels with uncertainty sampling based on entropy in predictions of the model to identify potentially inaccurate labels and correct for them using self-supervision or relabeling. The framework will incorporate social context of the post in terms of the community of its associated user for surfacing inaccurate labels towards building a large-scale dataset with minimum human effort. To provide labeled datasets with distinction of misleading narratives where information might be missing significant context or has inaccurate ancillary details, the proposed framework will use the few labeled samples as class prototypes to separate high confidence samples into false, unproven, mixture, mostly false, mostly true, true, and debunk information. The approach is demonstrated for providing a large-scale misinformation dataset on COVID-19 vaccines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题