Muss：通过采矿释义简化多语言无监督的句子

论文标题

Muss：通过采矿释义简化多语言无监督的句子

MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases

论文作者

Martin, Louis, Fan, Angela, de la Clergerie, Éric, Bordes, Antoine, Sagot, Benoît

论文摘要

缺乏标记的平行简化数据，尤其是在英语以外的其他语言中，句子简化的进展受到了阻碍。我们介绍了Muss，这是一种多语言无监督的简化系统，不需要标记的简化数据。穆斯（Muss）使用一种新颖的方法来简化句子，该方法使用句子级释义数据而不是适当的简化数据来训练强模型。这些模型利用无监督的预审计和可控的生成机制灵活地调整了推理时长度和词汇复杂性等属性。我们进一步提出了一种使用语义句子嵌入从通用爬网中以任何语言挖掘此类释义数据的方法，从而消除了对标记数据的需求。我们在英语，法语和西班牙语简化的基准测试中评估了我们的方法，尽管没有使用任何标记的简化数据，但尽管没有使用任何标记的简化数据，但与以前的最佳监督结果非常匹配或胜过以前的最佳监督结果。我们通过合并标记的简化数据来进一步推动最新技术。

Progress in sentence simplification has been hindered by a lack of labeled parallel simplification data, particularly in languages other than English. We introduce MUSS, a Multilingual Unsupervised Sentence Simplification system that does not require labeled simplification data. MUSS uses a novel approach to sentence simplification that trains strong models using sentence-level paraphrase data instead of proper simplification data. These models leverage unsupervised pretraining and controllable generation mechanisms to flexibly adjust attributes such as length and lexical complexity at inference time. We further present a method to mine such paraphrase data in any language from Common Crawl using semantic sentence embeddings, thus removing the need for labeled data. We evaluate our approach on English, French, and Spanish simplification benchmarks and closely match or outperform the previous best supervised results, despite not using any labeled simplification data. We push the state of the art further by incorporating labeled simplification data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题