受控文本还原

论文标题

受控文本还原

Controlled Text Reduction

论文作者

Slobodkin, Aviv, Roit, Paul, Hirsch, Eran, Ernst, Ori, Dagan, Ido

论文摘要

产生源文本的简化版本，例如在通用或集中的摘要中，固有地涉及两个不同的子任务：确定目标内容并生成一个连贯的文本来传达它。尽管一些流行的方法将摘要作为单一端到端任务，但著名的作品支持针对单个子任务的分解建模。此外，半自动化的文本减少也非常吸引人，用户可以在其中确定目标内容，而模型将产生相应的连贯摘要。在本文中，我们关注第二个子任务，即给出预选的内容，生成相干文本。具体而言，我们将\ textIt {控制文本减少}形式化为独立任务，其输入是具有标记的目标内容跨度的源文本（“突出显示”）。然后，模型需要生成一个连贯的文本，该文本仅包含所有目标信息。我们主张这些模型的潜力，包括模块化全自动摘要以及半自动的人类在环上用例。促进适当的研究，我们为任务众筹高质量的开发和测试数据集。此外，我们从可用的摘要基准测试中自动生成较大的“银”训练数据集，从而利用了预审预定的汇总源对齐模型。最后，使用这些数据集，我们提出了一个有监督的基线模型，显示出令人鼓舞的结果和有见地的分析。

Producing a reduced version of a source text, as in generic or focused summarization, inherently involves two distinct subtasks: deciding on targeted content and generating a coherent text conveying it. While some popular approaches address summarization as a single end-to-end task, prominent works support decomposed modeling for individual subtasks. Further, semi-automated text reduction is also very appealing, where users may identify targeted content while models would generate a corresponding coherent summary. In this paper, we focus on the second subtask, of generating coherent text given pre-selected content. Concretely, we formalize \textit{Controlled Text Reduction} as a standalone task, whose input is a source text with marked spans of targeted content ("highlighting"). A model then needs to generate a coherent text that includes all and only the target information. We advocate the potential of such models, both for modular fully-automatic summarization, as well as for semi-automated human-in-the-loop use cases. Facilitating proper research, we crowdsource high-quality dev and test datasets for the task. Further, we automatically generate a larger "silver" training dataset from available summarization benchmarks, leveraging a pretrained summary-source alignment model. Finally, employing these datasets, we present a supervised baseline model, showing promising results and insightful analyses.

下载PDF全文

下载文献需遵守相关版权规定

论文标题