论文标题

兰斯:大规模阿拉伯新闻摘要语料库

LANS: Large-scale Arabic News Summarization Corpus

论文作者

Alhamadani, Abdulaziz, Zhang, Xuchao, He, Jianfeng, Lu, Chang-Tien

论文摘要

文本摘要已经以多种语言进行了深入研究,有些语言已经达到了高级阶段。然而,阿拉伯文本摘要(ATS)仍处于发展阶段。现有的ATS数据集很小,要么缺乏多样性。我们构建了LAN,这是一个大规模且多样化的数据集,用于阿拉伯文本摘要任务。 Lans提供了840万篇文章及其摘要,从1999年至2019年期间从报纸网站元数据中提取。高质量和多样化的摘要是由来自22家阿拉伯报纸的新闻记者撰写的,其中包括来自每个来源的至少7个主题的折衷组合。我们通过自动和人类评估对LAN进行内在评估。人类对1000个随机样本的评估报告了我们收集的摘要的精度为95.4%,并且自动评估量化了摘要的多样性和抽象性。该数据集可应要求公开提供。

Text summarization has been intensively studied in many languages, and some languages have reached advanced stages. Yet, Arabic Text Summarization (ATS) is still in its developing stages. Existing ATS datasets are either small or lack diversity. We build, LANS, a large-scale and diverse dataset for Arabic Text Summarization task. LANS offers 8.4 million articles and their summaries extracted from newspapers websites metadata between 1999 and 2019. The high-quality and diverse summaries are written by journalists from 22 major Arab newspapers, and include an eclectic mix of at least more than 7 topics from each source. We conduct an intrinsic evaluation on LANS by both automatic and human evaluations. Human evaluation of 1000 random samples reports 95.4% accuracy for our collected summaries, and automatic evaluation quantifies the diversity and abstractness of the summaries. The dataset is publicly available upon request.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源