论文标题
我们距离坚固的长期抽象总结有多远?
How Far are We from Robust Long Abstractive Summarization?
论文作者
论文摘要
近年来,抽象性摘要取得了巨大进展。在这项工作中,我们执行精细的人类注释,以评估长文档的抽象摘要系统(即模型和指标),目的是实施它们以生成可靠的摘要。对于长期文档的抽象模型,我们表明,不断争取最先进的胭脂结果可能会导致我们产生更相关的摘要,但不是事实。对于长期的文档评估指标,人类评估结果表明,胭脂仍然是评估摘要相关性方面最好的。它还揭示了事实指标在检测不同类型的事实错误和Bartscore有效性背后的重要局限性。然后,我们建议在发展事实一致性指标的努力中有希望的方向。最后,我们发布了带注释的长文档数据集,希望它可以在更广泛的汇总设置中有助于指标的开发。
Abstractive summarization has made tremendous progress in recent years. In this work, we perform fine-grained human annotations to evaluate long document abstractive summarization systems (i.e., models and metrics) with the aim of implementing them to generate reliable summaries. For long document abstractive models, we show that the constant strive for state-of-the-art ROUGE results can lead us to generate more relevant summaries but not factual ones. For long document evaluation metrics, human evaluation results show that ROUGE remains the best at evaluating the relevancy of a summary. It also reveals important limitations of factuality metrics in detecting different types of factual errors and the reasons behind the effectiveness of BARTScore. We then suggest promising directions in the endeavor of developing factual consistency metrics. Finally, we release our annotated long document dataset with the hope that it can contribute to the development of metrics across a broader range of summarization settings.