两个巨大的标题和关键字一代研究文章

论文标题

两个巨大的标题和关键字一代研究文章

Two Huge Title and Keyword Generation Corpora of Research Articles

论文作者

Çano, Erion, Bojar, Ondřej

论文摘要

通过神经网络进行序列到序列学习的最新发展已大大提高了自动生成的文本摘要和文档关键词的质量，从而规定了对更大的培训语料库的需求。研究文章的元数据通常很容易在线查找，可用于对各种任务进行研究。在本文中，我们介绍了两个巨大的数据集用于文本摘要（OAGSX）和关键字生成（OAGKX）研究，分别包含3400万和2300万记录。数据是从开放的学术图中检索到的，该图是研究资料和出版物的网络。我们仔细处理了每个记录，还尝试了两种任务的几种提取性和抽象方法，以为其他研究人员创建绩效基线。我们进一步说明了这些方法预览其输出的性能。在不久的将来，我们希望将主题建模应用于两组，以从更具体的学科中得出研究文章的子集。

Recent developments in sequence-to-sequence learning with neural networks have considerably improved the quality of automatically generated text summaries and document keywords, stipulating the need for even bigger training corpora. Metadata of research articles are usually easy to find online and can be used to perform research on various tasks. In this paper, we introduce two huge datasets for text summarization (OAGSX) and keyword generation (OAGKX) research, containing 34 million and 23 million records, respectively. The data were retrieved from the Open Academic Graph which is a network of research profiles and publications. We carefully processed each record and also tried several extractive and abstractive methods of both tasks to create performance baselines for other researchers. We further illustrate the performance of those methods previewing their outputs. In the near future, we would like to apply topic modeling on the two sets to derive subsets of research articles from more specific disciplines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题