RPT：关系预训练的变压器几乎是使数据准备民主化所需的一切

论文标题

RPT：关系预训练的变压器几乎是使数据准备民主化所需的一切

RPT: Relational Pre-trained Transformer Is Almost All You Need towards Democratizing Data Preparation

论文作者

Tang, Nan, Fan, Ju, Li, Fangyi, Tu, Jianhong, Du, Xiaoyong, Li, Guoliang, Madden, Sam, Ouzzani, Mourad

论文摘要

人工智能可以帮助自动化人性化但计算机硬的数据准备任务，从而负担数据科学家，从业人员和人群工人的负担？我们通过提出RPT来回答这个问题，RPT是元组对X型号的Denoing自动编码器（X可以是元组，代币，标签，JSON等）。通过破坏输入元组，然后学习一个重建原始元组的模型，可以预先培训元组培训模型。它采用了基于变压器的神经翻译体系结构，该结构由双向编码器（类似于BERT）和从左到右的自动回归解码器（类似于GPT）组成，从而导致BERT和GPT的概括。预训练的RPT已经可以支持几个常见的数据准备任务，例如数据清洁，自动完成和模式匹配。更好的是，可以在广泛的数据准备任务上进行微调，例如价值归一化，数据转换，数据注释等，我们还讨论了几种有吸引力的技术，例如，对于实体解决方案，很少进行协作学习，少数射击学习，少量学习和NLP学习和NLP询问提取信息。此外，我们确定了一系列研究机会，以推进数据准备领域。

Can AI help automate human-easy but computer-hard data preparation tasks that burden data scientists, practitioners, and crowd workers? We answer this question by presenting RPT, a denoising auto-encoder for tuple-to-X models (X could be tuple, token, label, JSON, and so on). RPT is pre-trained for a tuple-to-tuple model by corrupting the input tuple and then learning a model to reconstruct the original tuple. It adopts a Transformer-based neural translation architecture that consists of a bidirectional encoder (similar to BERT) and a left-to-right autoregressive decoder (similar to GPT), leading to a generalization of both BERT and GPT. The pre-trained RPT can already support several common data preparation tasks such as data cleaning, auto-completion and schema matching. Better still, RPT can be fine-tuned on a wide range of data preparation tasks, such as value normalization, data transformation, data annotation, etc. To complement RPT, we also discuss several appealing techniques such as collaborative training and few-shot learning for entity resolution, and few-shot learning and NLP question-answering for information extraction. In addition, we identify a series of research opportunities to advance the field of data preparation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题