论文标题

遥远的监督语料库,用于在化学物质,疾病和基因之间提取生物医学关系

A Distant Supervision Corpus for Extracting Biomedical Relationships Between Chemicals, Diseases and Genes

论文作者

Zhang, Dongxu, Mohan, Sunil, Torkar, Michaela, McCallum, Andrew

论文摘要

我们介绍了Chemdisgene,这是一种用于培训和评估多级多标签文档级生物医学关系提取模型的新数据集。我们的数据集包含80K生物医学研究摘要,上面标有化学物质,疾病和基因的提及,这些部分在这些实体之间标记了18种类型的生物医学关系的人(打算评估),其余的(打算用于培训)通过CTD数据库备受远距离标记,该数据已通过CTD数据库远距离标记。与类似的先前数据集相比,我们的既更大又清洁。它还包括链接到其实体的注释。我们还提供了三个基线深神经网络关系提取模型,并在我们的新数据集中进行了训练和评估。

We introduce ChemDisGene, a new dataset for training and evaluating multi-class multi-label document-level biomedical relation extraction models. Our dataset contains 80k biomedical research abstracts labeled with mentions of chemicals, diseases, and genes, portions of which human experts labeled with 18 types of biomedical relationships between these entities (intended for evaluation), and the remainder of which (intended for training) has been distantly labeled via the CTD database with approximately 78\% accuracy. In comparison to similar preexisting datasets, ours is both substantially larger and cleaner; it also includes annotations linking mentions to their entities. We also provide three baseline deep neural network relation extraction models trained and evaluated on our new dataset.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源