WDV：由Wikidata构建的广泛数据语言数据集

论文标题

WDV：由Wikidata构建的广泛数据语言数据集

WDV: A Broad Data Verbalisation Dataset Built from Wikidata

论文作者

Amaral, Gabriel, Rodrigues, Odinaldo, Simperl, Elena

论文摘要

数据语言化是在当前自然语言处理领域非常重要的一项任务，因为在我们丰富的结构化和半结构化数据转换为人类可读格式方面有很大的好处。言语知识图（KG）的数据着重于将主题，谓词和对象形成的基于三重的互连说明转换为文本。尽管某些公斤存在KG语言数据集，但在许多情况下仍然存在差距。对于Wikidata来说，尤其如此，在这里可用的数据集则是宽松的夫妇主张，带有文本信息，或者重点关注围绕传记，城市和国家 /地区的谓词。为了解决这些差距，我们建议WDV是由Wikidata构建的一个大型KG声称的口头化数据集，在三元组和文本之间紧密耦合，涵盖了各种各样的实体和谓词。我们还通过可重复使用的工作流程来评估口语的质量，以衡量以人为本的流利度和充分性得分。我们的数据和代码公开可用，以期将研究促进kg口头化。

Data verbalisation is a task of great importance in the current field of natural language processing, as there is great benefit in the transformation of our abundant structured and semi-structured data into human-readable formats. Verbalising Knowledge Graph (KG) data focuses on converting interconnected triple-based claims, formed of subject, predicate, and object, into text. Although KG verbalisation datasets exist for some KGs, there are still gaps in their fitness for use in many scenarios. This is especially true for Wikidata, where available datasets either loosely couple claim sets with textual information or heavily focus on predicates around biographies, cities, and countries. To address these gaps, we propose WDV, a large KG claim verbalisation dataset built from Wikidata, with a tight coupling between triples and text, covering a wide variety of entities and predicates. We also evaluate the quality of our verbalisations through a reusable workflow for measuring human-centred fluency and adequacy scores. Our data and code are openly available in the hopes of furthering research towards KG verbalisation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题