论文标题
如何找到强大的摘要相干措施?工具箱和比较研究用于汇总相干度量评估
How to Find Strong Summary Coherence Measures? A Toolbox and a Comparative Study for Summary Coherence Measure Evaluation
论文作者
论文摘要
自动评估摘要的连贯性具有重要意义,既可以实现成本效益的摘要评估,又是通过选择高分候选候选摘要来提高连贯性的工具。尽管已经提出了许多不同的方法来模拟摘要相干性,但通常使用不同的数据集和指标对其进行评估。这使得很难理解他们的相对性能,并确定朝着更好的摘要连贯建模的方向发展。在这项工作中,我们对各种方法进行了大规模调查,以在均匀的竞争环境上进行汇总一致性建模。此外,我们介绍了两项新的分析措施,即系统内相关性和偏置矩阵,它们有助于确定相干度量的偏见,并为系统级别的混杂因素提供鲁棒性。尽管当前可用的自动连贯性测量措施都无法为所有评估指标的系统摘要分配可靠的相干得分,但对自我监督任务进行了微调的大规模语言模型显示出令人鼓舞的结果,只要微调考虑到他们需要在不同的摘要长度上概括它们。
Automatically evaluating the coherence of summaries is of great significance both to enable cost-efficient summarizer evaluation and as a tool for improving coherence by selecting high-scoring candidate summaries. While many different approaches have been suggested to model summary coherence, they are often evaluated using disparate datasets and metrics. This makes it difficult to understand their relative performance and identify ways forward towards better summary coherence modelling. In this work, we conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field. Additionally, we introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders. While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models fine-tuned on self-supervised tasks show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.