生物医学指定实体识别

论文标题

生物医学指定实体识别

Biomedical Named Entity Recognition at Scale

论文作者

Kocaman, Veysel, Talby, David

论文摘要

指定的实体识别（NER）是一项广泛适用的自然语言处理任务和问题答案，主题建模，信息检索等的基础，NER在临床笔记和报告中提取有意义的块，然后将其供应到下游任务，例如在下游任务中提取有意义的块，例如断言状态检测，分辨率解决，相互分解，以及DE-DERACTION，DE-DERACTION，DE-SERTATIVAL。在Apache Spark的顶部重新实现Bi-LSTM-CNN-CHN深度学习架构，我们提出了一个可训练的NER模型，该模型在七个公共生物医学基准测试中获得了新的最先进的结果，而无需使用诸如BERT之类的繁重的上下文嵌入。这包括将BC4CHEMD提高到93.72％（增益4.1％），物种800％至80.91％（增益4.6％）和JNLPBA至81.29％（增益5.2％）。此外，作为开源Spains NLP库的一部分，该模型可以在生产级代码库中自由使用；可以在任何火花集群中进行训练和推断；为Python，R，Scala和Java等流行的编程语言提供GPU支持和库；并且可以扩展以支持其他人类语言，而无需更改代码。

Named entity recognition (NER) is a widely applicable natural language processing task and building block of question answering, topic modeling, information retrieval, etc. In the medical domain, NER plays a crucial role by extracting meaningful chunks from clinical notes and reports, which are then fed to downstream tasks like assertion status detection, entity resolution, relation extraction, and de-identification. Reimplementing a Bi-LSTM-CNN-Char deep learning architecture on top of Apache Spark, we present a single trainable NER model that obtains new state-of-the-art results on seven public biomedical benchmarks without using heavy contextual embeddings like BERT. This includes improving BC4CHEMD to 93.72% (4.1% gain), Species800 to 80.91% (4.6% gain), and JNLPBA to 81.29% (5.2% gain). In addition, this model is freely available within a production-grade code base as part of the open-source Spark NLP library; can scale up for training and inference in any Spark cluster; has GPU support and libraries for popular programming languages such as Python, R, Scala and Java; and can be extended to support other human languages with no code changes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题