论文标题

科学文档的分层多标签分类

Hierarchical Multi-Label Classification of Scientific Documents

论文作者

Sadat, Mobashir, Caragea, Cornelia

论文摘要

已经对自动主题分类进行了广泛的研究,以协助数字集合中的管理和索引科学文档。近年来,随着大量主题可用,因此有必要将其安排在层次结构中。因此,自动分类系统需要能够通过层次进行分类。此外,每篇论文通常被分配给一个以上的相关主题。例如,可以将纸分配给层次树中的几个主题。在本文中,我们介绍了一个新的数据集,该数据集用于SCIHTC的科学论文,其中包含来自ACM CCS树的186,160篇论文和1,233个类别。我们为HMLTC建立了强大的基准,并提出了一种多任务学习方法,用于主题分类,关键字标记作为辅助任务。我们的最佳模型达到了34.57%的宏F1分数,这表明该数据集在分层科学主题分类方面提供了重要的研究机会。我们在GitHub上提供数据集和代码。

Automatic topic classification has been studied extensively to assist managing and indexing scientific documents in a digital collection. With the large number of topics being available in recent years, it has become necessary to arrange them in a hierarchy. Therefore, the automatic classification systems need to be able to classify the documents hierarchically. In addition, each paper is often assigned to more than one relevant topic. For example, a paper can be assigned to several topics in a hierarchy tree. In this paper, we introduce a new dataset for hierarchical multi-label text classification (HMLTC) of scientific papers called SciHTC, which contains 186,160 papers and 1,233 categories from the ACM CCS tree. We establish strong baselines for HMLTC and propose a multi-task learning approach for topic classification with keyword labeling as an auxiliary task. Our best model achieves a Macro-F1 score of 34.57% which shows that this dataset provides significant research opportunities on hierarchical scientific topic classification. We make our dataset and code available on Github.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源