论文标题
使用深层上下文化语言模型的语义标签
Semantic Labeling Using a Deep Contextualized Language Model
论文作者
论文摘要
为数据表的列值自动生成架构标签具有许多数据科学应用程序,例如架构匹配以及数据发现和链接。例如,可以通过预测的架构标签填充自动提取的表格,从而显着最大程度地减少人类的努力。此外,预测的标签可以减少多个数据表中名称不一致的影响。了解列值和上下文信息之间的连接是一个重要但被忽视的方面,因为先前提出的方法独立处理每个列。在本文中,我们使用列值和上下文提出了一种上下文感知的语义标记方法。我们的新方法基于用于语义标签的新设置,在该设置中,我们依次预测带有缺少标头的输入表的标签。我们使用预先训练的上下文化语言模型Bert将每个数据列的值和上下文都结合在一起,该模型在多个自然语言处理任务方面取得了重大改进。据我们所知,我们是第一个成功应用BERT来解决语义标签任务的人。我们使用来自不同领域的两个现实世界数据集评估了我们的方法,我们证明了评估指标比基于最新功能的方法的实质性改进。
Generating schema labels automatically for column values of data tables has many data science applications such as schema matching, and data discovery and linking. For example, automatically extracted tables with missing headers can be filled by the predicted schema labels which significantly minimizes human effort. Furthermore, the predicted labels can reduce the impact of inconsistent names across multiple data tables. Understanding the connection between column values and contextual information is an important yet neglected aspect as previously proposed methods treat each column independently. In this paper, we propose a context-aware semantic labeling method using both the column values and context. Our new method is based on a new setting for semantic labeling, where we sequentially predict labels for an input table with missing headers. We incorporate both the values and context of each data column using the pre-trained contextualized language model, BERT, that has achieved significant improvements in multiple natural language processing tasks. To our knowledge, we are the first to successfully apply BERT to solve the semantic labeling task. We evaluate our approach using two real-world datasets from different domains, and we demonstrate substantial improvements in terms of evaluation metrics over state-of-the-art feature-based methods.