论文标题

Wikipedia引用:用英语Wikipedia提取的标识符的全面引用数据集

Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia

论文作者

Singh, Harshdeep, West, Robert, Colavizza, Giovanni

论文摘要

Wikipedia的内容基于可靠和发布的来源。到目前为止,对Wikipedia依赖的来源的了解相对较少,部分原因是提取引用并确定引用的来源是具有挑战性的。为了缩小这一差距,我们发布了Wikipedia引文,这是从Wikipedia提取的全面引用数据集。截至2020年5月,从610万英文Wikipedia文章中提取了2930万个引用,并归类为书籍,期刊文章或网络内容。因此,我们能够向具有已知标识符的学术出版物提取4000万个引用(包括DOI,PMC,PMID和ISBN),并通过CrossRef的DOIS进一步装备261k的引用。结果,我们发现Wikipedia文章中有6.7%的文章与相关的DOI引用了至少一篇期刊文章,并且Wikipedia仅引用了所有目前在科学网络中索引的文章中的2%。我们发布我们的代码,以使社区可以扩展我们的工作并在将来更新数据集。

Wikipedia's contents are based on reliable and published sources. To this date, relatively little is known about what sources Wikipedia relies on, in part because extracting citations and identifying cited sources is challenging. To close this gap, we release Wikipedia Citations, a comprehensive dataset of citations extracted from Wikipedia. A total of 29.3M citations were extracted from 6.1M English Wikipedia articles as of May 2020, and classified as being to books, journal articles or Web contents. We were thus able to extract 4.0M citations to scholarly publications with known identifiers -- including DOI, PMC, PMID, and ISBN -- and further equip an extra 261K citations with DOIs from Crossref. As a result, we find that 6.7% of Wikipedia articles cite at least one journal article with an associated DOI, and that Wikipedia cites just 2% of all articles with a DOI currently indexed in the Web of Science. We release our code to allow the community to extend upon our work and update the dataset in the future.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源