论文标题

构建PubMed知识图

Building a PubMed knowledge graph

论文作者

Xu, Jian, Kim, Sunkyu, Song, Min, Jeong, Minbyul, Kim, Donghyeon, Kang, Jaewoo, Rousseau, Justin F., Li, Xin, Xu, Weijia, Torvik, Vetle I., Bu, Yi, Chen, Chongyan, Ebeid, Islam Akef, Li, Daifeng, Ding, Ying

论文摘要

PubMed是医疗领域的重要资源,但是有用的概念要么难以提取,要么被模棱两可,这极大地阻碍了知识发现。为了解决这个问题,我们通过从2900万PubMed摘要中提取生物本性,消除作者姓名,通过国家卫生研究所(NIH)出口商的资金数据,收集隶属关系历史和从MAPFAFFIL的良好的附属数据中收集作者的教育背景,从而构建了一个PubMed知识图(PKG)(PKG)。通过集成可靠的多源数据,我们可以在生物本身,作者,文章,隶属关系和资金之间建立联系。数据验证表明,生物本性提取的生物学深度学习方法显着超过了基于F1分数(0.51%)的最先进模型,作者名称歧义(和)达到98.09%的F1分数。 PKG可以触发更广泛的创新,不仅使我们能够衡量学术影响,知识使用和知识转移,而且还可以帮助我们根据与生物实体的联系来分析作者和组织。 PKG可在Figshare(https://figshare.com/s/6327A555355FC2C99999F3A2,简化版本中,排除PubMed Raw Data)和TACC网站(http://er.tacc.utexas.utexas.edu/datasets/ped ped,完整版本)。

PubMed is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguated, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID, and identifying fine-grained affiliation data from MapAffil. Through the integration of the credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving a F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities. The PKG is freely available on Figshare (https://figshare.com/s/6327a55355fc2c99f3a2, simplified version that exclude PubMed raw data) and TACC website (http://er.tacc.utexas.edu/datasets/ped, full version).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源