论文标题
从临床文本中提取COVID-19诊断和症状:一个新的注释语料库和神经事件提取框架
Extracting COVID-19 Diagnoses and Symptoms From Clinical Text: A New Annotated Corpus and Neural Event Extraction Framework
论文作者
论文摘要
冠状病毒疾病2019(Covid-19)是全球大流行。尽管自从出现以来,已经了解了有关新型冠状病毒的许多知识,但仍有许多与跟踪其传播,描述症状,预测感染严重程度以及预测医疗保健利用的开放问题。自由文本临床注释包含用于解决这些问题的关键信息。需要数据驱动的自动信息提取模型来在大规模研究中使用此文本编码的信息。这项工作提出了一个新的临床语料库,称为COVID-19注释的临床文本(CACT)语料库,其中包含1,472个注释,并具有表征Covid-19的诊断,测试和临床表现的详细注释。我们介绍了一个基于跨度的事件提取模型,该模型共同提取所有带注释的现象,在识别具有相关断言值的COVID-19和症状事件(事件为0.83-0.97 F1)中获得了高性能(对于事件的0.83-0.97 f1和0.73-0.79 f1的主张)。在二次使用应用中,我们使用结构化患者数据(例如生命体征和实验室结果)探讨了Covid-19测试结果的预测,并自动提取了症状信息。仅除结构化数据之外,自动提取的症状提高了预测性能。
Coronavirus disease 2019 (COVID-19) is a global pandemic. Although much has been learned about the novel coronavirus since its emergence, there are many open questions related to tracking its spread, describing symptomology, predicting the severity of infection, and forecasting healthcare utilization. Free-text clinical notes contain critical information for resolving these questions. Data-driven, automatic information extraction models are needed to use this text-encoded information in large-scale studies. This work presents a new clinical corpus, referred to as the COVID-19 Annotated Clinical Text (CACT) Corpus, which comprises 1,472 notes with detailed annotations characterizing COVID-19 diagnoses, testing, and clinical presentation. We introduce a span-based event extraction model that jointly extracts all annotated phenomena, achieving high performance in identifying COVID-19 and symptom events with associated assertion values (0.83-0.97 F1 for events and 0.73-0.79 F1 for assertions). In a secondary use application, we explored the prediction of COVID-19 test results using structured patient data (e.g. vital signs and laboratory results) and automatically extracted symptom information. The automatically extracted symptoms improve prediction performance, beyond structured data alone.