论文标题

STEM-ECR数据集:基础科学实体参考在STEM学术内容中,对权威百科全书和词典源来源

The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources

论文作者

D'Souza, Jennifer, Hoppe, Anett, Brack, Arthur, Jaradeh, Mohamad Yaser, Auer, Sören, Ewerth, Ralph

论文摘要

我们介绍了科学实体提取,分类和解决方案的STEM(科学,技术,工程和医学)数据集,版本1.0(STEM-ECR V1.0)。已经开发了STEM-ECR V1.0数据集,以提供基准,以评估科学实体提取,分类和分辨率任务,以域独立的方式。它包括10个茎学科中的摘要,这些摘要被认为是主要出版平台上最多产的摘要。我们描述了这样的多学科语料库的创建,并从以下特征上强调了所获得的发现:1)在多学科科学环境中,对科学实体的通用概念形式主义; 2)在这种通用形式主义下,科学实体无域的人类注释的可行性; 3)可以使用基于BERT的神经模型自动提取多学科科学实体的性能基准; 4)通过百科全书实体连接和词典词Sense siens dismampuation通过百科全书的人类注释进行人类注释的三步实体解决程序; 5)对Babelfy的人类评估返回了我们实体的百科全书和词典感觉。我们的发现累积地表明,人类的注释和自动学习多学科科学概念以及它们在广泛环境中的语义歧义是合理的。

We introduce the STEM (Science, Technology, Engineering, and Medicine) Dataset for Scientific Entity Extraction, Classification, and Resolution, version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. We describe the creation of such a multidisciplinary corpus and highlight the obtained findings in terms of the following features: 1) a generic conceptual formalism for scientific entities in a multidisciplinary scientific context; 2) the feasibility of the domain-independent human annotation of scientific entities under such a generic formalism; 3) a performance benchmark obtainable for automatic extraction of multidisciplinary scientific entities using BERT-based neural models; 4) a delineated 3-step entity resolution procedure for human annotation of the scientific entities via encyclopedic entity linking and lexicographic word sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic links and lexicographic senses for our entities. Our findings cumulatively indicate that human annotation and automatic learning of multidisciplinary scientific concepts as well as their semantic disambiguation in a wide-ranging setting as STEM is reasonable.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源