论文标题
鸡蛋:标准数据插补的可扩展图神经网络
EGG-GAE: scalable graph neural networks for tabular data imputation
论文作者
论文摘要
在处理各个域的表格数据集时,缺少数据插补(MDI)至关重要。可以对自动编码器进行训练以重建缺失值,并且在为给定实例提出新值时,图形自动编码器(GAE)可以考虑数据集中的类似模式。但是,先前建议的GAE遇到了可伸缩性问题,要求用户在模式之间定义相似性度量,以事先构建图形连接。在本文中,我们利用潜在图中的最新进展提出了一种新型的边缘生成图自动编码器(EGG-GAE),以克服这两个缺点的缺失数据。 Egg-Gae可用于随机采样的输入数据(因此扩展到较大的数据集),并且它自动渗透每个体系结构层的迷你批次上的最佳连接性。我们还尝试了几个扩展,包括推理的集合策略以及包括我们所谓的原型节点的包含,并在多个基准和基础线上,在插补误差和最终下游准确性方面获得了重大改进。
Missing data imputation (MDI) is crucial when dealing with tabular datasets across various domains. Autoencoders can be trained to reconstruct missing values, and graph autoencoders (GAE) can additionally consider similar patterns in the dataset when imputing new values for a given instance. However, previously proposed GAEs suffer from scalability issues, requiring the user to define a similarity metric among patterns to build the graph connectivity beforehand. In this paper, we leverage recent progress in latent graph imputation to propose a novel EdGe Generation Graph AutoEncoder (EGG-GAE) for missing data imputation that overcomes these two drawbacks. EGG-GAE works on randomly sampled mini-batches of the input data (hence scaling to larger datasets), and it automatically infers the best connectivity across the mini-batch for each architecture layer. We also experiment with several extensions, including an ensemble strategy for inference and the inclusion of what we call prototype nodes, obtaining significant improvements, both in terms of imputation error and final downstream accuracy, across multiple benchmarks and baselines.