Datgan：将专家知识纳入综合表格数据的深度学习

论文标题

Datgan：将专家知识纳入综合表格数据的深度学习

DATGAN: Integrating expert knowledge into deep learning for synthetic tabular data

论文作者

Lederrey, Gael, Hillel, Tim, Bierlaire, Michel

论文摘要

合成数据可用于各种应用程序中，例如纠正偏置数据集或替换稀缺的原始数据以进行仿真。生成的对抗网络（GAN）被认为是开发生成模型的最先进。但是，这些深度学习模型是数据驱动的，因此很难控制生成过程。因此，它可能导致以下问题：生成的数据中缺乏代表性，偏见的引入以及过度适合样本噪声的可能性。本文介绍了定向的无环形GAN（DATGAN）来解决这些局限性，通过在深度学习模型中整合合成表格数据生成的专家知识。这种方法允许使用有向的无环图（DAG）明确指定变量之间的相互作用。然后将DAG转换为修改后的长期记忆（LSTM）细胞网络，以接受多个输入。多个Datgan版本对多个评估指标进行了系统的测试。我们表明，在多个案例研究中，Datgan的最佳版本优于最先进的生成模型。最后，我们展示了DAG如何创建假设的合成数据集。

Synthetic data can be used in various applications, such as correcting bias datasets or replacing scarce original data for simulation purposes. Generative Adversarial Networks (GANs) are considered state-of-the-art for developing generative models. However, these deep learning models are data-driven, and it is, thus, difficult to control the generation process. It can, therefore, lead to the following issues: lack of representativity in the generated data, the introduction of bias, and the possibility of overfitting the sample's noise. This article presents the Directed Acyclic Tabular GAN (DATGAN) to address these limitations by integrating expert knowledge in deep learning models for synthetic tabular data generation. This approach allows the interactions between variables to be specified explicitly using a Directed Acyclic Graph (DAG). The DAG is then converted to a network of modified Long Short-Term Memory (LSTM) cells to accept multiple inputs. Multiple DATGAN versions are systematically tested on multiple assessment metrics. We show that the best versions of the DATGAN outperform state-of-the-art generative models on multiple case studies. Finally, we show how the DAG can create hypothetical synthetic datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题