论文标题
PTAB:使用预先训练的语言模型来建模表格数据
PTab: Using the Pre-trained Language Model for Modeling Tabular Data
论文作者
论文摘要
表格数据是信息时代的基础,并且已经进行了广泛的研究。最近的研究表明,基于神经的模型可有效学习表格数据的上下文表示。学习有效的上下文表示需要有意义的功能和大量数据。但是,当前方法通常无法正确地从没有语义信息的功能中从功能中学习上下文表示。此外,由于数据集之间的差异,可以通过混合表格数据集扩大训练的训练。为了解决这些问题,我们使用预先训练的语言模型来模拟表格数据,提出了一个新颖的框架PTAB。 PTAB通过三阶段处理来了解表格数据的上下文表示:模态转换(MT),掩盖语言微调(MF)和分类微调(CF)。我们使用预训练的模型(PTM)初始化模型,其中包含从大规模语言数据中学到的语义信息。因此,可以在微调阶段有效地学习上下文表示。此外,我们可以自然地将文本化的表格数据混合在一起,以扩大训练集以进一步改善表示形式学习。我们在八个流行的表格分类数据集上评估了PTAB。实验结果表明,与最先进的基线相比,我们的方法在监督设置中取得了更好的AUC分数(例如XGBoost),并且在半监视设置下的表现优于对应方法。我们提出可视化结果,显示PTAB具有基于实例的解释性。
Tabular data is the foundation of the information age and has been extensively studied. Recent studies show that neural-based models are effective in learning contextual representation for tabular data. The learning of an effective contextual representation requires meaningful features and a large amount of data. However, current methods often fail to properly learn a contextual representation from the features without semantic information. In addition, it's intractable to enlarge the training set through mixed tabular datasets due to the difference between datasets. To address these problems, we propose a novel framework PTab, using the Pre-trained language model to model Tabular data. PTab learns a contextual representation of tabular data through a three-stage processing: Modality Transformation(MT), Masked-Language Fine-tuning(MF), and Classification Fine-tuning(CF). We initialize our model with a pre-trained Model (PTM) which contains semantic information learned from the large-scale language data. Consequently, contextual representation can be learned effectively during the fine-tuning stages. In addition, we can naturally mix the textualized tabular data to enlarge the training set to further improve representation learning. We evaluate PTab on eight popular tabular classification datasets. Experimental results show that our method has achieved a better average AUC score in supervised settings compared to the state-of-the-art baselines(e.g. XGBoost), and outperforms counterpart methods under semi-supervised settings. We present visualization results that show PTab has well instance-based interpretability.