罚款：细粒度的自动对话级评估

论文标题

罚款：细粒度的自动对话级评估

FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation

论文作者

Zhang, Chen, D'Haro, Luis Fernando, Zhang, Qiquan, Friedrichs, Thomas, Li, Haizhou

论文摘要

最新的基于模型的无参考指标用于开放域对话评估表现出与人类判断的有希望的相关性。但是，他们要么执行转向级别的评估，要么查看单个对话质量维度。人们期望一个良好的评估指标可以评估对话级别的多个质量维度。为此，我们有动力提出一个多维对话级指标，该度量由三个子量表组成，每个子量都针对特定的维度。子对象经过新颖的自我监督目标训练，并与人类各自的维度判断有着密切的相关性。此外，我们探索了两种结合子计量学的方法：度量集合和多任务学习。两种方法都产生了一个整体度量标准，可以显着超过单个子计量学。与现有的最新指标相比，在三个高质量的对话级评估基准中，合并的度量标准平均取得了相对改善的16％。

Recent model-based reference-free metrics for open-domain dialogue evaluation exhibit promising correlations with human judgment. However, they either perform turn-level evaluation or look at a single dialogue quality dimension. One would expect a good evaluation metric to assess multiple quality dimensions at the dialogue level. To this end, we are motivated to propose a multi-dimensional dialogue-level metric, which consists of three sub-metrics with each targeting a specific dimension. The sub-metrics are trained with novel self-supervised objectives and exhibit strong correlations with human judgment for their respective dimensions. Moreover, we explore two approaches to combine the sub-metrics: metric ensemble and multitask learning. Both approaches yield a holistic metric that significantly outperforms individual sub-metrics. Compared to the existing state-of-the-art metric, the combined metrics achieve around 16% relative improvement on average across three high-quality dialogue-level evaluation benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题