论文标题

笛卡尔:生成Wikipedia文章的简短描述

Descartes: Generating Short Descriptions of Wikipedia Articles

论文作者

Sakota, Marija, Peyrard, Maxime, West, Robert

论文摘要

维基百科是当今网络上最丰富的知识来源之一。为了促进导航,搜索和维护其内容,Wikipedia的指南指出,所有文章都应通过所谓的简短描述来注释所有文章,以表明该文章的主题(例如,简短描述啤酒是“用发酵谷物制成的酒精饮料”)。但是,很大一部分文章(从荷兰人的10.2%到哈萨克的99.7%不等),对数百万的Wikipedia使用者的影响还没有简短的描述。在这个问题的激励下,我们介绍了自动生成Wikipedia文章简短描述并提出Descartes的新颖任务,Descartes是一种用于解决该文章的多语言模型。笛卡尔集成了三个信息来源,以目标语言生成文章描述:本文中文章的所有语言版本,本文已经存在的描述(如果有的话),其他语言中的文本以及从知识图获得的语义类型信息。我们评估了一种训练训练25种语言的笛卡尔模型,表明它击败了基线(包括基于翻译的强大基线),并以针对特定语言量身定制的单语模型进行表演。对三种语言的人类评估进一步表明,笛卡尔描述的质量与人写的描述几乎没有区别。例如,我们的英语描述中有91.3%(vs. 92.1%的人写的描述)通过了栏杆,以纳入Wikipedia,这表明笛卡尔已经准备好生产生产,有可能支持人类编辑在当今跨语言的Wikipedia中填补当今Wikipedia的主要差距。

Wikipedia is one of the richest knowledge sources on the Web today. In order to facilitate navigating, searching, and maintaining its content, Wikipedia's guidelines state that all articles should be annotated with a so-called short description indicating the article's topic (e.g., the short description of beer is "Alcoholic drink made from fermented cereal grains"). Nonetheless, a large fraction of articles (ranging from 10.2% in Dutch to 99.7% in Kazakh) have no short description yet, with detrimental effects for millions of Wikipedia users. Motivated by this problem, we introduce the novel task of automatically generating short descriptions for Wikipedia articles and propose Descartes, a multilingual model for tackling it. Descartes integrates three sources of information to generate an article description in a target language: the text of the article in all its language versions, the already-existing descriptions (if any) of the article in other languages, and semantic type information obtained from a knowledge graph. We evaluate a Descartes model trained for handling 25 languages simultaneously, showing that it beats baselines (including a strong translation-based baseline) and performs on par with monolingual models tailored for specific languages. A human evaluation on three languages further shows that the quality of Descartes's descriptions is largely indistinguishable from that of human-written descriptions; e.g., 91.3% of our English descriptions (vs. 92.1% of human-written descriptions) pass the bar for inclusion in Wikipedia, suggesting that Descartes is ready for production, with the potential to support human editors in filling a major gap in today's Wikipedia across languages.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源