论文标题

Wikipedia文章之间的语义关系的成对多级文档分类

Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles

论文作者

Ostendorff, Malte, Ruas, Terry, Schubotz, Moritz, Rehm, Georg, Gipp, Bela

论文摘要

许多数字图书馆都向用户推荐文献,以考虑查询文档及其存储库之间的相似性。但是,他们通常无法区分什么关系的关系,这两个文档都一样。在本文中,我们对找到两个文档之间的关系作为成对文档分类任务的问题进行建模。为了找到文档之间的语义关系,我们在不同的配置(例如,序列长度,矢量串联方案)中应用了一系列技术,例如手套,段落 - 向量,Bert和XLNet,包括用于变压器基于变压器系统的暹罗体系结构。我们对32,168个Wikipedia文章和Wikidata属性的新提出的数据集进行了实验,该数据集定义了语义文档关系。我们的结果表明,香草·伯特(Vanilla Bert)是最佳性能系统,F1得分为0.93,我们手动检查它,以更好地了解其对其他领域的适用性。我们的发现表明,对文档之间的语义关系进行分类是一项可解决的任务,并激发了基于评估技术的推荐系统的开发。本文中的讨论是通过类似Sparql的查询来探索文档的第一步,从而可以找到一个方面相似但在另一方面不同的文档。

Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93, which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源