论文标题

TABDDPM:用扩散模型对表格数据进行建模

TabDDPM: Modelling Tabular Data with Diffusion Models

论文作者

Kotelnikov, Akim, Baranchuk, Dmitry, Rubachev, Ivan, Babenko, Artem

论文摘要

denoising扩散概率模型当前正在成为许多重要数据模式的生成建模的领先范式。作为计算机视觉社区中最普遍的扩散模型,最近在其他领域也引起了一些关注,包括语音,NLP和类似图形的数据。在这项工作中,我们研究了扩散模型的框架是否对一般表格问题有利,其中数据标记通常由异质特征的向量表示。表格数据的固有异质性使得准确的建模变得非常具有挑战性,因为单个特征可以具有完全不同的性质,即,其中一些特征可以是连续的,并且其中一些可能是离散的。为了解决此类数据类型,我们介绍TABDDPM - 一个扩散模型,可以普遍应用于任何表格数据集并处理任何类型的功能。我们在广泛的基准上广泛评估了TABDDPM,并证明了其优于现有的GAN/VAE替代方案,这与其他领域的扩散模型的优势一致。此外,我们表明TABDDPM有资格获得面向隐私的设置,在该设置中,原始数据点不能公开共享。

Denoising diffusion probabilistic models are currently becoming the leading paradigm of generative modeling for many important data modalities. Being the most prevalent in the computer vision community, diffusion models have also recently gained some attention in other domains, including speech, NLP, and graph-like data. In this work, we investigate if the framework of diffusion models can be advantageous for general tabular problems, where datapoints are typically represented by vectors of heterogeneous features. The inherent heterogeneity of tabular data makes it quite challenging for accurate modeling, since the individual features can be of completely different nature, i.e., some of them can be continuous and some of them can be discrete. To address such data types, we introduce TabDDPM -- a diffusion model that can be universally applied to any tabular dataset and handles any type of feature. We extensively evaluate TabDDPM on a wide set of benchmarks and demonstrate its superiority over existing GAN/VAE alternatives, which is consistent with the advantage of diffusion models in other fields. Additionally, we show that TabDDPM is eligible for privacy-oriented setups, where the original datapoints cannot be publicly shared.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源