论文标题
在RDF图中紧凑频繁的恒星图案
Compacting Frequent Star Patterns in RDF Graphs
论文作者
论文摘要
知识图已成为使用图数据模型(例如资源描述框架(RDF))代表实体及其属性的流行形式主义。 RDF图包括使用属性注释的标记边缘连接到对象或其他实体的相同类型的实体。 RDF图通常包含在特定属性组中共享相同对象的实体,即它们匹配由这些属性和对象组成的星形模式。如果这些恒星模式中的这些实体或属性的数量很大,则RDF图和查询处理的大小受到负面影响;我们将这些恒星图案称为频繁的恒星图案。我们解决了识别RDF图中频繁的星模式并设计分解的RDF图的概念的问题,该图表示RDF图的紧凑表示,其中将频繁的星模式的数量最小化。我们还开发了计算方法,以识别频繁的恒星模式并生成分解的RDF图,其中紧凑的RDF分子替换了频繁的恒星模式。频繁恒星图案的紧凑型RDF分子表示实例化相应恒星图案的RDF子图。添加了一个替代实体,而不是让所有与原始频繁的星形模式匹配的实体,而是与频繁的星形图案的属性相关。它链接到最初与频繁星形图案匹配的实体。我们在几个RDF图基准上评估了分解技术的性能,并与在GSPAN顶部建立的基线进行比较,GSPAN是一种最先进的算法,可检测频繁的模式。结果证据证明了拟议方法的效率,并表明我们的技术能够减少至少三个数量级的基线方法的执行时间,从而将RDF图的大小降低了66.56%。
Knowledge graphs have become a popular formalism for representing entities and their properties using a graph data model, e.g., the Resource Description Framework (RDF). An RDF graph comprises entities of the same type connected to objects or other entities using labeled edges annotated with properties. RDF graphs usually contain entities that share the same objects in a certain group of properties, i.e., they match star patterns composed of these properties and objects. In case the number of these entities or properties in these star patterns is large, the size of the RDF graph and query processing are negatively impacted; we refer these star patterns as frequent star patterns. We address the problem of identifying frequent star patterns in RDF graphs and devise the concept of factorized RDF graphs, which denote compact representations of RDF graphs where the number of frequent star patterns is minimized. We also develop computational methods to identify frequent star patterns and generate a factorized RDF graph, where compact RDF molecules replace frequent star patterns. A compact RDF molecule of a frequent star pattern denotes an RDF subgraph that instantiates the corresponding star pattern. Instead of having all the entities matching the original frequent star pattern, a surrogate entity is added and related to the properties of the frequent star pattern; it is linked to the entities that originally match the frequent star pattern. We evaluate the performance of our factorization techniques on several RDF graph benchmarks and compare with a baseline built on top of gSpan, a state-of-the-art algorithm to detect frequent patterns. The outcomes evidence the efficiency of proposed approach and show that our techniques are able to reduce execution time of the baseline approach in at least three orders of magnitude reducing the RDF graph size by up to 66.56%.