论文标题
HNET:图形高几何网络
HNet: Graphical Hypergeometric Networks
论文作者
论文摘要
动机:现实世界中的数据通常包含具有连续值和离散值的测量值。尽管许多库可用,但具有混合数据类型的数据集需要密集的预处理步骤,并且描述变量之间的关系仍然是一个挑战。数据理解阶段是数据挖掘过程中的重要步骤,但是,在没有对数据上做任何假设的情况下,搜索空间在变量数量中是超指数。方法:我们提出了图形高几何网络(HNET),这是一种使用统计推断来测试变量跨变量的关联的方法。目的是仅使用重要关联来确定网络,以阐明跨变量的复杂关系。 HNET处理原始的非结构化数据集并输出一个由(部分)定向或无方向边缘组成的网络(即变量)。为了评估HNET的准确性,我们使用了众所周知的数据集,此外还使用了具有已知地面真相的数据集。将HNET的性能与贝叶斯结构学习进行了比较。结果:我们证明HNET在检测节点链接时显示出很高的精度和性能。对于警报数据集,我们可以平均证明MCC得分为0.33 + 0.0002(p <1x10-6),而贝叶斯结构学习的平均MCC得分为0.52 + 0.006(P <1x10-11),并随机分配边缘,MCC得分为0.004 + 0.004 + 0.0003(p = 0.49)。结论:HNET可以处理原始的非结构化数据集,允许分析混合数据类型,它可以轻松地扩大变量数量,并允许对检测到的关联进行详细检查。可用性:https://erdogant.github.io/hnet/
Motivation: Real-world data often contain measurements with both continuous and discrete values. Despite the availability of many libraries, data sets with mixed data types require intensive pre-processing steps, and it remains a challenge to describe the relationships between variables. The data understanding phase is an important step in the data mining process, however, without making any assumptions on the data, the search space is super-exponential in the number of variables. Methods: We propose graphical hypergeometric networks (HNet), a method to test associations across variables for significance using statistical inference. The aim is to determine a network using only the significant associations in order to shed light on the complex relationships across variables. HNet processes raw unstructured data sets and outputs a network that consists of (partially) directed or undirected edges between the nodes (i.e., variables). To evaluate the accuracy of HNet, we used well known data sets and in addition generated data sets with known ground truth. The performance of HNet is compared to Bayesian structure learning. Results: We demonstrate that HNet showed high accuracy and performance in the detection of node links. In the case of the Alarm data set we can demonstrate on average an MCC score of 0.33 + 0.0002 (P<1x10-6), whereas Bayesian structure learning resulted in an average MCC score of 0.52 + 0.006 (P<1x10-11), and randomly assigning edges resulted in a MCC score of 0.004 + 0.0003 (P=0.49). Conclusions: HNet can process raw unstructured data sets, allows analysis of mixed data types, it easily scales up in number of variables, and allows detailed examination of the detected associations. Availability: https://erdogant.github.io/hnet/