论文标题
MEDJEX:具有Wiki超链接跨度和上下文化语言模型得分的医学行话提取模型
MedJEx: A Medical Jargon Extraction Model with Wiki's Hyperlink Span and Contextualized Masked Language Model Score
论文作者
论文摘要
本文提出了一种新的自然语言处理(NLP)申请,以识别患者从电子健康记录(EHR)注释中理解的医疗术语术语。我们首先介绍了一个小说且公开的数据集,其中包含18k+ EHR Note句子($ MEDJ $)的专家宣布的医学术语。然后,我们引入了一种新颖的医疗术语提取($ MEDJEX $)型号,该模型已显示出优于现有的最新NLP模型。首先,MEDJEX在辅助Wikipedia超链接SPAN数据集上进行了培训时,改善了整体性能,其中超链接跨度提供了其他Wikipedia文章来解释跨度(或术语),然后在注释的MEDJ数据上进行微调。其次,我们发现上下文化的蒙版语言模型得分有益于检测特定领域的陌生行话术语。此外,我们的结果表明,对辅助Wikipedia超链接跨度数据集进行的培训改善了八个生物医学命名实体识别基准数据集中的六个。 MEDJ和MEDJEX均可公开使用。
This paper proposes a new natural language processing (NLP) application for identifying medical jargon terms potentially difficult for patients to comprehend from electronic health record (EHR) notes. We first present a novel and publicly available dataset with expert-annotated medical jargon terms from 18K+ EHR note sentences ($MedJ$). Then, we introduce a novel medical jargon extraction ($MedJEx$) model which has been shown to outperform existing state-of-the-art NLP models. First, MedJEx improved the overall performance when it was trained on an auxiliary Wikipedia hyperlink span dataset, where hyperlink spans provide additional Wikipedia articles to explain the spans (or terms), and then fine-tuned on the annotated MedJ data. Secondly, we found that a contextualized masked language model score was beneficial for detecting domain-specific unfamiliar jargon terms. Moreover, our results show that training on the auxiliary Wikipedia hyperlink span datasets improved six out of eight biomedical named entity recognition benchmark datasets. Both MedJ and MedJEx are publicly available.