论文标题

受控文本还原

Controlled Text Reduction

论文作者

Slobodkin, Aviv, Roit, Paul, Hirsch, Eran, Ernst, Ori, Dagan, Ido

论文摘要

产生源文本的简化版本,例如在通用或集中的摘要中,固有地涉及两个不同的子任务:确定目标内容并生成一个连贯的文本来传达它。尽管一些流行的方法将摘要作为单一端到端任务,但著名的作品支持针对单个子任务的分解建模。此外,半自动化的文本减少也非常吸引人,用户可以在其中确定目标内容,而模型将产生相应的连贯摘要。 在本文中,我们关注第二个子任务,即给出预选的内容,生成相干文本。具体而言,我们将\ textIt {控制文本减少}形式化为独立任务,其输入是具有标记的目标内容跨度的源文本(“突出显示”)。然后,模型需要生成一个连贯的文本,该文本仅包含所有目标信息。我们主张这些模型的潜力,包括模块化全自动摘要以及半自动的人类在环上用例。促进适当的研究,我们为任务众筹高质量的开发和测试数据集。此外,我们从可用的摘要基准测试中自动生成较大的“银”训练数据集,从而利用了预审预定的汇总源对齐模型。最后,使用这些数据集,我们提出了一个有监督的基线模型,显示出令人鼓舞的结果和有见地的分析。

Producing a reduced version of a source text, as in generic or focused summarization, inherently involves two distinct subtasks: deciding on targeted content and generating a coherent text conveying it. While some popular approaches address summarization as a single end-to-end task, prominent works support decomposed modeling for individual subtasks. Further, semi-automated text reduction is also very appealing, where users may identify targeted content while models would generate a corresponding coherent summary. In this paper, we focus on the second subtask, of generating coherent text given pre-selected content. Concretely, we formalize \textit{Controlled Text Reduction} as a standalone task, whose input is a source text with marked spans of targeted content ("highlighting"). A model then needs to generate a coherent text that includes all and only the target information. We advocate the potential of such models, both for modular fully-automatic summarization, as well as for semi-automated human-in-the-loop use cases. Facilitating proper research, we crowdsource high-quality dev and test datasets for the task. Further, we automatically generate a larger "silver" training dataset from available summarization benchmarks, leveraging a pretrained summary-source alignment model. Finally, employing these datasets, we present a supervised baseline model, showing promising results and insightful analyses.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源