论文标题
在细粒度的层次结构中对Wikipedia进行分类:哪些图可以贡献
Classifying Wikipedia in a fine-grained hierarchy: what graphs can contribute
论文作者
论文摘要
维基百科是机器学习的巨大机会,是最大的半结构化知识基础。因此,许多作品研究了其内容,并专注于构造它,以使其可在学习任务中使用,例如通过将其分类为本体。除文本内容外,Wikipedia还显示了典型的图形结构,其中页面通过引用链接在一起。在本文中,我们解决了集成图(即结构)信息的任务,以将Wikipedia分类为一个命名命名的实体本体论(NE),即扩展名称的实体层次结构。为了解决此任务,我们首先通过评估图形结构对NE分类的相关性开始。然后,我们探索两个方向,一个与特征向量相关的图形描述符,通常在大规模网络分析中使用,一个将平面分类扩展到加权模型,考虑到语义相似性。我们对从日本维基百科中提取的22,000页的手动标记为22,000页的子集进行了尺度实践实验。我们的结果表明,集成图表在减少输入特征空间的稀疏性方面取得成功,并产生比以前的工作相当或更好的分类结果。
Wikipedia is a huge opportunity for machine learning, being the largest semi-structured base of knowledge available. Because of this, many works examine its contents, and focus on structuring it in order to make it usable in learning tasks, for example by classifying it into an ontology. Beyond its textual contents, Wikipedia also displays a typical graph structure, where pages are linked together through citations. In this paper, we address the task of integrating graph (i.e. structure) information to classify Wikipedia into a fine-grained named entity ontology (NE), the Extended Named Entity hierarchy. To address this task, we first start by assessing the relevance of the graph structure for NE classification. We then explore two directions, one related to feature vectors using graph descriptors commonly used in large-scale network analysis, and one extending flat classification to a weighted model taking into account semantic similarity. We conduct at-scale practical experiments, on a manually labeled subset of 22,000 pages extracted from the Japanese Wikipedia. Our results show that integrating graph information succeeds at reducing sparsity of the input feature space, and yields classification results that are comparable or better than previous works.