论文标题

X-SCITLDR:学术文献的跨语性极端摘要

X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents

论文作者

Takeshita, Sotaro, Green, Tommaso, Friedrich, Niklas, Eckert, Kai, Ponzetto, Simone Paolo

论文摘要

如今,科学出版物的数量正在迅速增加,导致研究人员的信息超负荷,并使学者们很难了解当前的趋势和工作方式。因此,最近在学术出版物上应用文本挖掘技术的工作已经调查了自动文本摘要技术(包括极端摘要)的应用。但是,以前的工作仅集中在单语言环境上,主要是英语。在本文中,我们填补了这一研究差距,并为学术领域中四种不同语言提供了一个抽象的跨语性摘要数据集,这使我们能够训练和评估以德语,意大利语,中国和日语来处理英语论文并生成摘要的模型。我们介绍了新的X-SCITLDR数据集,用于多语言摘要,并基于最先进的多语言预训练的模型进行彻底基准的不同模型,包括两阶段的“总结和翻译”方法和直接的跨语性模型。我们还使用英语单语言摘要和机器翻译作为中间任务探索中级培训的好处,并在零和几次场景中分析性能。

The number of scientific publications nowadays is rapidly increasing, causing information overload for researchers and making it hard for scholars to keep up to date with current trends and lines of work. Consequently, recent work on applying text mining technologies for scholarly publications has investigated the application of automatic text summarization technologies, including extreme summarization, for this domain. However, previous work has concentrated only on monolingual settings, primarily in English. In this paper, we fill this research gap and present an abstractive cross-lingual summarization dataset for four different languages in the scholarly domain, which enables us to train and evaluate models that process English papers and generate summaries in German, Italian, Chinese and Japanese. We present our new X-SCITLDR dataset for multilingual summarization and thoroughly benchmark different models based on a state-of-the-art multilingual pre-trained model, including a two-stage `summarize and translate' approach and a direct cross-lingual model. We additionally explore the benefits of intermediate-stage training using English monolingual summarization and machine translation as intermediate tasks and analyze performance in zero- and few-shot scenarios.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源