Wikipedia列表中的基于变压器的主题实体检测

论文标题

Wikipedia列表中的基于变压器的主题实体检测

Transformer-based Subject Entity Detection in Wikipedia Listings

论文作者

Heist, Nicolas, Paulheim, Heiko

论文摘要

在问题回答或文本摘要之类的任务中，必须拥有有关相关实体的背景知识。有关实体的信息，尤其是关于长尾或新兴实体的信息 - 诸如DBPedia或Caligraph之类的公共知识图中远非完整。在本文中，我们提出了一种方法，该方法利用了清单的半结构性质（例如枚举和表格）来识别清单项目的主要实体（即条目和行的）。这些实体（我们称为主题实体）可用于增加知识图的覆盖范围。我们的方法使用变压器网络在令牌级别上识别主题实体，并在绩效方面超过了现有方法，同时受到限制的限制。由于灵活的输入格式，它适用于任何形式的清单，与先前的工作不同，不依赖于实体边界作为输入。我们通过将其应用于完整的Wikipedia语料库，并提取4000万个受试者实体的提及，以估计精度为71％，召回77％来证明我们的方法。结果纳入了Caligraph的最新版本。

In tasks like question answering or text summarisation, it is essential to have background knowledge about the relevant entities. The information about entities - in particular, about long-tail or emerging entities - in publicly available knowledge graphs like DBpedia or CaLiGraph is far from complete. In this paper, we present an approach that exploits the semi-structured nature of listings (like enumerations and tables) to identify the main entities of the listing items (i.e., of entries and rows). These entities, which we call subject entities, can be used to increase the coverage of knowledge graphs. Our approach uses a transformer network to identify subject entities at the token-level and surpasses an existing approach in terms of performance while being bound by fewer limitations. Due to a flexible input format, it is applicable to any kind of listing and is, unlike prior work, not dependent on entity boundaries as input. We demonstrate our approach by applying it to the complete Wikipedia corpus and extracting 40 million mentions of subject entities with an estimated precision of 71% and recall of 77%. The results are incorporated in the most recent version of CaLiGraph.

下载PDF全文

下载文献需遵守相关版权规定

论文标题