论文标题
生物医学指定实体识别
Biomedical Named Entity Recognition at Scale
论文作者
论文摘要
指定的实体识别(NER)是一项广泛适用的自然语言处理任务和问题答案,主题建模,信息检索等的基础,NER在临床笔记和报告中提取有意义的块,然后将其供应到下游任务,例如在下游任务中提取有意义的块,例如断言状态检测,分辨率解决,相互分解,以及DE-DERACTION,DE-DERACTION,DE-SERTATIVAL。在Apache Spark的顶部重新实现Bi-LSTM-CNN-CHN深度学习架构,我们提出了一个可训练的NER模型,该模型在七个公共生物医学基准测试中获得了新的最先进的结果,而无需使用诸如BERT之类的繁重的上下文嵌入。这包括将BC4CHEMD提高到93.72%(增益4.1%),物种800%至80.91%(增益4.6%)和JNLPBA至81.29%(增益5.2%)。此外,作为开源Spains NLP库的一部分,该模型可以在生产级代码库中自由使用;可以在任何火花集群中进行训练和推断;为Python,R,Scala和Java等流行的编程语言提供GPU支持和库;并且可以扩展以支持其他人类语言,而无需更改代码。
Named entity recognition (NER) is a widely applicable natural language processing task and building block of question answering, topic modeling, information retrieval, etc. In the medical domain, NER plays a crucial role by extracting meaningful chunks from clinical notes and reports, which are then fed to downstream tasks like assertion status detection, entity resolution, relation extraction, and de-identification. Reimplementing a Bi-LSTM-CNN-Char deep learning architecture on top of Apache Spark, we present a single trainable NER model that obtains new state-of-the-art results on seven public biomedical benchmarks without using heavy contextual embeddings like BERT. This includes improving BC4CHEMD to 93.72% (4.1% gain), Species800 to 80.91% (4.6% gain), and JNLPBA to 81.29% (5.2% gain). In addition, this model is freely available within a production-grade code base as part of the open-source Spark NLP library; can scale up for training and inference in any Spark cluster; has GPU support and libraries for popular programming languages such as Python, R, Scala and Java; and can be extended to support other human languages with no code changes.