论文标题
通过响应选择评估对话生成系统
Evaluating Dialogue Generation Systems via Response Selection
论文作者
论文摘要
开放域对话响应产生系统的现有自动评估指标与人类评估的相关性很差。我们专注于通过响应选择评估响应产生系统。为了通过响应选择正确评估系统,我们提出了使用精心挑选的假候选者构建响应选择测试集的方法。具体而言,我们建议构建测试集,以滤除某些类型的错误候选者:(i)与地面响应无关的那些,以及(ii)那些可以接受的响应。通过实验,我们证明,与广泛使用的自动评估指标(如BLEU)相比,通过我们方法开发的测试集评估系统与人类评估更加密切相关。
Existing automatic evaluation metrics for open-domain dialogue response generation systems correlate poorly with human evaluation. We focus on evaluating response generation systems via response selection. To evaluate systems properly via response selection, we propose the method to construct response selection test sets with well-chosen false candidates. Specifically, we propose to construct test sets filtering out some types of false candidates: (i) those unrelated to the ground-truth response and (ii) those acceptable as appropriate responses. Through experiments, we demonstrate that evaluating systems via response selection with the test sets developed by our method correlates more strongly with human evaluation, compared with widely used automatic evaluation metrics such as BLEU.