论文标题

了解汇总评估指标衡量摘要的信息质量的程度

Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries

论文作者

Deutsch, Daniel, Roth, Dan

论文摘要

基于参考的指标(例如Rouge或Bertscore)通过将摘要与参考进行比较来评估摘要的内容质量。理想情况下,该比较应通过计算摘要的共同信息来衡量摘要的信息质量。在这项工作中,我们分析了Rouge和Bertscore使用的令牌对齐方式比较摘要,并认为它们的分数在很大程度上不能解释为测量信息重叠,而是他们讨论相同主题的程度。此外,我们提供的证据表明,对于许多其他摘要评估指标,此结果都是正确的。该结果的结果是,它意味着摘要社区尚未找到与其研究目标保持一致的可靠自动指标,以产生具有高质量信息的摘要。然后,我们提出了一种评估摘要的简单和可解释的方法,该方法可以直接测量信息重叠,并证明如何使用它来洞悉模型行为,而单独的其他方法无法提供。

Reference-based metrics such as ROUGE or BERTScore evaluate the content quality of a summary by comparing the summary to a reference. Ideally, this comparison should measure the summary's information quality by calculating how much information the summaries have in common. In this work, we analyze the token alignments used by ROUGE and BERTScore to compare summaries and argue that their scores largely cannot be interpreted as measuring information overlap, but rather the extent to which they discuss the same topics. Further, we provide evidence that this result holds true for many other summarization evaluation metrics. The consequence of this result is that it means the summarization community has not yet found a reliable automatic metric that aligns with its research goal, to generate summaries with high-quality information. Then, we propose a simple and interpretable method of evaluating summaries which does directly measure information overlap and demonstrate how it can be used to gain insights into model behavior that could not be provided by other methods alone.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源