liputan6：一个大型印尼数据集用于文本摘要

论文标题

liputan6：一个大型印尼数据集用于文本摘要

Liputan6: A Large-scale Indonesian Dataset for Text Summarization

论文作者

Koto, Fajri, Lau, Jey Han, Baldwin, Timothy

论文摘要

在本文中，我们介绍了一个大规模的印尼摘要数据集。我们从在线新闻门户网站Liputan6.com上收集文章，并获得215,827个文档 - 苏格尔对。我们利用预先训练的语言模型来通过基于多语言和单语的BERT模型在数据集上开发基准的提取性和抽象性摘要方法。我们通过检查机器生成的摘要，这些摘要具有较低的胭脂分数，并与Rouge IT-Seft以及提取性和抽象性摘要模型相比，包括彻底的错误分析。

In this paper, we introduce a large-scale Indonesian summarization dataset. We harvest articles from Liputan6.com, an online news portal, and obtain 215,827 document-summary pairs. We leverage pre-trained language models to develop benchmark extractive and abstractive summarization methods over the dataset with multilingual and monolingual BERT-based models. We include a thorough error analysis by examining machine-generated summaries that have low ROUGE scores, and expose both issues with ROUGE it-self, as well as with extractive and abstractive summarization models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题