知识图上的汇总查询：使用语义吸引抽样快速近似

论文标题

知识图上的汇总查询：使用语义吸引抽样快速近似

Aggregate Queries on Knowledge Graphs: Fast Approximation with Semantic-aware Sampling

论文作者

Wang, Yuxiang, Khan, Arijit, Xu, Xiaoliang, Jin, Jiahui, Hong, Qifan, Fu, Tao

论文摘要

知识图（kg）以模式富足的方式管理大规模和现实世界的事实。总查询是公里的基本查询，例如，“德国生产的汽车的平均价格是多少？”。尽管它很重要，但在文献中回答了KGS的总疑问几乎没有受到关注。可以根据Factoid查询，例如“找到德国生产的所有汽车”，通过在Factoid查询的答案上应用额外的骨料操作来支持聚集查询。但是，这种简单的方法具有挑战性，因为Factoid查询处理的准确性和效率都会严重影响总查询的性能。在本文中，我们提出了一个“抽样估计”模型，以回答KGS上的汇总查询，这是提供有效准确性保证的第一项提供近似汇总结果的工作，而无需依赖Factoid查询。具体而言，我们首先提出语义意识采样，以基于知识图嵌入的随机步行来收集高质量的随机样本。然后，我们提出了对AVG的计数，总和和一致估计器的无偏估计量，以根据随机样本计算近似聚合结果，并以置信区间的形式准确保证。我们扩展了支持精度的迭代提高的方法，并使用过滤器，组和不同的图形形状，例如链，循环，恒星，花朵，更复杂的查询。对现实世界KGS的广泛实验证明了我们方法的有效性和效率。

A knowledge graph (KG) manages large-scale and real-world facts as a big graph in a schema-flexible manner. Aggregate query is a fundamental query over KGs, e.g., "what is the average price of cars produced in Germany?". Despite its importance, answering aggregate queries on KGs has received little attention in the literature. Aggregate queries can be supported based on factoid queries, e.g., "find all cars produced in Germany", by applying an additional aggregate operation on factoid queries' answers. However, this straightforward method is challenging because both the accuracy and efficiency of factoid query processing will seriously impact the performance of aggregate queries. In this paper, we propose a "sampling-estimation" model to answer aggregate queries over KGs, which is the first work to provide an approximate aggregate result with an effective accuracy guarantee, and without relying on factoid queries. Specifically, we first present a semantic-aware sampling to collect a high-quality random sample through a random walk based on knowledge graph embedding. Then, we propose unbiased estimators for COUNT, SUM, and a consistent estimator for AVG to compute the approximate aggregate results based on the random sample, with an accuracy guarantee in the form of confidence interval. We extend our approach to support iterative improvement of accuracy, and more complex queries with filter, GROUP-BY, and different graph shapes, e.g., chain, cycle, star, flower. Extensive experiments over real-world KGs demonstrate the effectiveness and efficiency of our approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题