分布式非负恢复与自动型号选择Exascale数据

论文标题

分布式非负恢复与自动型号选择Exascale数据

Distributed non-negative RESCAL with Automatic Model Selection for Exascale Data

论文作者

Bhattarai, Manish, Kharat, Namita, Skau, Erik, Nebgen, Benjamin, Djidjev, Hristo, Rajopadhye, Sanjay, Smith, James P., Alexandrov, Boian

论文摘要

随着计算机硬件和软件，社交媒体，物联网平台和通信的开发的繁荣，世界各地生产的数据量呈指数增长。在这些数据中，关系数据集越来越受欢迎，因为它们提供了有关社区及其相互作用的独特见解。关系数据集自然是非负，稀疏和超大的。关系数据通常包含三倍（主题，关系，对象），并表示为图形/多编写词，称为知识图，需要将其嵌入到低维密度的密度向量空间中。在各种嵌入模型中，Rescal允许学习关系数据可以在潜在变量上提取后验分布并对缺失关系进行预测。但是，恢复在计算上是要求的，需要快速，分布式的实现来分析超大现实世界数据集。在这里，我们介绍了一种用于异质CPU/GPU架构的分布式非阴性恢复算法，并自动选择了潜在社区数量（模型选择），称为Pydrescalk。我们证明了与现实世界和大合成张量的Pydrescalk的正确性，以及显示出与理论复杂性一致的近乎线性缩放的功效。最后，pydrescalk确定了11始于稀疏的稀疏合成张量中潜在社区的数量。

With the boom in the development of computer hardware and software, social media, IoT platforms, and communications, there has been an exponential growth in the volume of data produced around the world. Among these data, relational datasets are growing in popularity as they provide unique insights regarding the evolution of communities and their interactions. Relational datasets are naturally non-negative, sparse, and extra-large. Relational data usually contain triples, (subject, relation, object), and are represented as graphs/multigraphs, called knowledge graphs, which need to be embedded into a low-dimensional dense vector space. Among various embedding models, RESCAL allows learning of relational data to extract the posterior distributions over the latent variables and to make predictions of missing relations. However, RESCAL is computationally demanding and requires a fast and distributed implementation to analyze extra-large real-world datasets. Here we introduce a distributed non-negative RESCAL algorithm for heterogeneous CPU/GPU architectures with automatic selection of the number of latent communities (model selection), called pyDRESCALk. We demonstrate the correctness of pyDRESCALk with real-world and large synthetic tensors, and the efficacy showing near-linear scaling that concurs with the theoretical complexities. Finally, pyDRESCALk determines the number of latent communities in an 11-terabyte dense and 9-exabyte sparse synthetic tensor.

下载PDF全文

下载文献需遵守相关版权规定

论文标题