论文标题

深度学习以共同的模式匹配,估算和转换数据库

Deep Learning to Jointly Schema Match, Impute, and Transform Databases

论文作者

Tripathi, Sandhya, Fritz, Bradley A., Abdelhack, Mohamed, Avidan, Michael S., Chen, Yixin, King, Christopher R.

论文摘要

所有数据科学领域面临的应用问题是协调数据源。将来自多个起源的数据与未倍率的和只有部分重叠的功能连接在一起是开发和测试可靠,可推广算法的先决条件,尤其是在医疗保健方面。我们以数字功能(例如几乎高斯和二进制特征)的常见但困难的情况来解决这个问题,在这种情况下,单位变化和可变偏移使单变量摘要的简单匹配失败。我们开发了两个新的程序来解决这个问题。首先,我们根据其与其他功能的关联来展示多种“指纹”功能的方法。在设置什至适度的先验信息中,这允许准确识别大多数共享功能。其次,我们演示了一种深度学习算法,用于在数据库之间翻译。与先前的方法不同,我们的算法利用了发现的映射,同时识别未共享特征和学习转换的替代物。在使用两个电子健康记录数据库的综合和现实世界实验中,我们的算法优于现有的基线用于匹配变量集,同时共同学习估算未共享或转换变量。

An applied problem facing all areas of data science is harmonizing data sources. Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing and testing robust, generalizable algorithms, especially in health care. We approach this issue in the common but difficult case of numeric features such as nearly Gaussian and binary features, where unit changes and variable shift make simple matching of univariate summaries unsuccessful. We develop two novel procedures to address this problem. First, we demonstrate multiple methods of "fingerprinting" a feature based on its associations to other features. In the setting of even modest prior information, this allows most shared features to be accurately identified. Second, we demonstrate a deep learning algorithm for translation between databases. Unlike prior approaches, our algorithm takes advantage of discovered mappings while identifying surrogates for unshared features and learning transformations. In synthetic and real-world experiments using two electronic health record databases, our algorithms outperform existing baselines for matching variable sets, while jointly learning to impute unshared or transformed variables.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源