论文标题
基础模型可以纠缠您的数据吗?
Can Foundation Models Wrangle Your Data?
论文作者
论文摘要
基础模型(FMS)是对大量数据培训的模型,这些模型在很大程度上可以概括为新任务而无需任何特定任务的填充。随着这些模型的规模不断增长,创新继续推动这些模型在语言和图像任务上所能做到的事情的界限。本文旨在了解FMS的一个未充满刺激的领域:清洁和集成等经典数据任务。作为概念验证,我们将五个数据清洁和集成任务作为促使任务并评估FMS在这些任务上的性能。我们发现,即使未针对这些数据任务进行培训,大型FMS在数据清洁和集成任务上的概括并实现了SOTA性能。我们确定了这些模型所提出的特定研究挑战和机遇,包括具有特定于私有和域数据的挑战,以及使数据管理系统更容易被非专家访问的机会。我们在以下网址公开提供代码和实验:https://github.com/hazyresearch/fm_data_tasks。
Foundation Models (FMs) are models trained on large corpora of data that, at very large scale, can generalize to new tasks without any task-specific finetuning. As these models continue to grow in size, innovations continue to push the boundaries of what these models can do on language and image tasks. This paper aims to understand an underexplored area of FMs: classical data tasks like cleaning and integration. As a proof-of-concept, we cast five data cleaning and integration tasks as prompting tasks and evaluate the performance of FMs on these tasks. We find that large FMs generalize and achieve SoTA performance on data cleaning and integration tasks, even though they are not trained for these data tasks. We identify specific research challenges and opportunities that these models present, including challenges with private and domain specific data, and opportunities to make data management systems more accessible to non-experts. We make our code and experiments publicly available at: https://github.com/HazyResearch/fm_data_tasks.