论文标题

TNT-KID:用于关键字识别的基于变压器的神经标记器

TNT-KID: Transformer-based Neural Tagger for Keyword Identification

论文作者

Martinc, Matej, Škrlj, Blaž, Pollak, Senja

论文摘要

随着越来越多的可用文本数据,能够自动分析,分类和汇总这些数据的算法的开发已成为必要。在这项研究中,我们提出了一种用于关键字识别的新型算法,即提取一个代表给定文档的关键方面的一个或多字短语,称为基于变压器的神经标记器用于关键字识别(TNT-KID)。通过将变压器体系结构适应手头的特定任务,并利用在特定领域的语料库上预测的语言模型,该模型能够克服受监督和无监督和不受监督的最先进方法,从而通过在各种数据集中提供各种数据集的竞争性效果,同时只需提供手动实验室的数据,从而可以通过提供竞争性和可靠的性能。这项研究还提供了彻底的错误分析,并对模型的内部运作有宝贵的见解,并进行了消融研究,以测量关键字识别工作流程对整体性能的特定组成部分的影响。

With growing amounts of available textual data, development of algorithms capable of automatic analysis, categorization and summarization of these data has become a necessity. In this research we present a novel algorithm for keyword identification, i.e., an extraction of one or multi-word phrases representing key aspects of a given document, called Transformer-based Neural Tagger for Keyword IDentification (TNT-KID). By adapting the transformer architecture for a specific task at hand and leveraging language model pretraining on a domain specific corpus, the model is capable of overcoming deficiencies of both supervised and unsupervised state-of-the-art approaches to keyword extraction by offering competitive and robust performance on a variety of different datasets while requiring only a fraction of manually labeled data required by the best performing systems. This study also offers thorough error analysis with valuable insights into the inner workings of the model and an ablation study measuring the influence of specific components of the keyword identification workflow on the overall performance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源