论文标题

Google数据集按数字搜索

Google Dataset Search by the Numbers

论文作者

Benjelloun, Omar, Chen, Shiyu, Noy, Natasha

论文摘要

科学家,政府和公司越来越多地在网络上发布数据集。 Google的数据集搜索提取数据集元数据(使用schema.org和类似词汇表表示)从网页中表达,以使数据集可发现。自2016年开始在数据集搜索上的工作以来,schema.org中描述的数据集数量已从约500k增加到近30m。因此,该语料库已成为网络上数据的宝贵快照。据我们所知,这个语料库是同类产品中最大,最多样化的。我们分析了该语料库,并讨论数据集的起源,它们涵盖了哪些主题,他们采用的形式以及搜索数据集对哪些人感兴趣的人。根据此分析,我们确定了差距和可能的未来工作,以帮助使数据更加可发现。

Scientists, governments, and companies increasingly publish datasets on the Web. Google's Dataset Search extracts dataset metadata -- expressed using schema.org and similar vocabularies -- from Web pages in order to make datasets discoverable. Since we started the work on Dataset Search in 2016, the number of datasets described in schema.org has grown from about 500K to almost 30M. Thus, this corpus has become a valuable snapshot of data on the Web. To the best of our knowledge, this corpus is the largest and most diverse of its kind. We analyze this corpus and discuss where the datasets originate from, what topics they cover, which form they take, and what people searching for datasets are interested in. Based on this analysis, we identify gaps and possible future work to help make data more discoverable.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源