论文标题

COVID-19文献​​的文档分类

Document Classification for COVID-19 Literature

论文作者

Gutiérrez, Bernal Jiménez, Zeng, Juncheng, Zhang, Dongdong, Zhang, Ping, Su, Yu

论文摘要

全球大流行使比以往任何时候都更加重要,可以快速,准确地检索相关的科学文献,以在广泛领域的研究人员的有效消费中消费。我们对Litcovid数据集上的几种多标签文档分类模型进行了分析,该模型越来越多地收集了有关2019年新颖2019年冠状病毒的23,000个研究论文。我们发现,该数据集上的预先训练的语言模型优于所有其他基准,并且Biobert超过了其他基线,而微型F1的边距很小,精度得分分别为86%和75%。我们将这些模型的数据效率和普遍性评估为任何准备应对当前健康危机等紧急情况的系统的重要特征。最后,我们探讨了50个错误的错误模型在Litcovid文档上犯下的50个错误,并发现它们经常(1)将某些标签过于紧密地相关联,并且(2)未能专注于文章的歧视性部分;这两者都是未来工作要解决的重要问题。数据和代码都可以在GitHub上找到。

The global pandemic has made it more important than ever to quickly and accurately retrieve relevant scientific literature for effective consumption by researchers in a wide range of fields. We provide an analysis of several multi-label document classification models on the LitCovid dataset, a growing collection of 23,000 research papers regarding the novel 2019 coronavirus. We find that pre-trained language models fine-tuned on this dataset outperform all other baselines and that BioBERT surpasses the others by a small margin with micro-F1 and accuracy scores of around 86% and 75% respectively on the test set. We evaluate the data efficiency and generalizability of these models as essential features of any system prepared to deal with an urgent situation like the current health crisis. Finally, we explore 50 errors made by the best performing models on LitCovid documents and find that they often (1) correlate certain labels too closely together and (2) fail to focus on discriminative sections of the articles; both of which are important issues to address in future work. Both data and code are available on GitHub.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源