论文标题
在山脉19上综合命名实体识别,并有遥远或弱的监督
Comprehensive Named Entity Recognition on CORD-19 with Distant or Weak Supervision
论文作者
论文摘要
我们在COVID-19开放研究数据集挑战(Cord-19)语料库(2020-03-13)上使用了全面的命名实体识别(NER)创建了这个绳索数据集。该绳索数据集涵盖了75种细粒实体类型:除了常见的生物医学实体类型(例如基因,化学物质和疾病)外,它涵盖了许多与CoVID-19的研究明确相关的许多新实体类型。机制和潜在疫苗。绳索注释是四种具有不同NER方法的来源的组合。绳索注释的质量超过了Scispacy(根据样本文档集,F1分数高10%),这是一种完全监督的Bioner工具。此外,Cord-Ner支持逐步添加新文档,并在需要时添加数十个种子作为输入示例,并在需要时添加新实体类型。我们将根据Cord-19语料库的增量更新和系统的改进来不断地更新脐带符。
We created this CORD-NER dataset with comprehensive named entity recognition (NER) on the COVID-19 Open Research Dataset Challenge (CORD-19) corpus (2020-03-13). This CORD-NER dataset covers 75 fine-grained entity types: In addition to the common biomedical entity types (e.g., genes, chemicals and diseases), it covers many new entity types related explicitly to the COVID-19 studies (e.g., coronaviruses, viral proteins, evolution, materials, substrates and immune responses), which may benefit research on COVID-19 related virus, spreading mechanisms, and potential vaccines. CORD-NER annotation is a combination of four sources with different NER methods. The quality of CORD-NER annotation surpasses SciSpacy (over 10% higher on the F1 score based on a sample set of documents), a fully supervised BioNER tool. Moreover, CORD-NER supports incrementally adding new documents as well as adding new entity types when needed by adding dozens of seeds as the input examples. We will constantly update CORD-NER based on the incremental updates of the CORD-19 corpus and the improvement of our system.