论文标题
ed-faith:评估忠诚的对话摘要
ED-FAITH: Evaluating Dialogue Summarization on Faithfulness
论文作者
论文摘要
抽象性摘要模型通常会产生对输入不忠的内容,从而突出了评估生成摘要的忠诚的重要性。大多数忠实指标仅在新闻领域进行评估,可以将其转移到其他摘要任务中吗?在这项工作中,我们首先介绍了对对话摘要的忠诚指标的系统研究。我们在对话数据集上评估了共同的忠诚指标,并观察到,尽管在新闻数据集上表现良好,但大多数指标与人类判断的相关性很差。鉴于这些发现,为了提高现有指标在对话摘要上的表现,我们首先在域内数据集上进行了芬特,然后对负样本进行了不可能的培训,并证明它们可以成功地提高对话数据上的度量性能。受T0语言模型的强劲零拍摄性能的启发,我们进一步提出了T0得分 - 一种用于忠实评估的新指标,该指标显示出对跨多个领域的基线指标的一致改进。
Abstractive summarization models typically generate content unfaithful to the input, thus highlighting the significance of evaluating the faithfulness of generated summaries. Most faithfulness metrics are only evaluated on news domain, can they be transferred to other summarization tasks? In this work, we first present a systematic study of faithfulness metrics for dialogue summarization. We evaluate common faithfulness metrics on dialogue datasets and observe that most metrics correlate poorly with human judgements despite performing well on news datasets. Given these findings, to improve existing metrics' performance on dialogue summarization, we first finetune on in-domain dataset, then apply unlikelihood training on negative samples, and show that they can successfully improve metric performance on dialogue data. Inspired by the strong zero-shot performance of the T0 language model, we further propose T0-Score -- a new metric for faithfulness evaluation, which shows consistent improvement against baseline metrics across multiple domains.