论文标题

实现对开放域对话系统的可靠人类评估

Achieving Reliable Human Assessment of Open-Domain Dialogue Systems

论文作者

Ji, Tianbo, Graham, Yvette, Jones, Gareth J. F., Lyu, Chenyang, Liu, Qun

论文摘要

对开放域对话系统的评估是高度挑战性的,并且迫切需要一次又一次地突出显示更好的技术。尽管在最近的竞争中对系统进行可靠的实时评估做出了巨大努力,但已经放弃了注释,据报道过于不可靠,无法产生明智的结果。这是一个严重的问题,因为尚不知道自动指标可以很好地表明可能或可能不是高质量的对话。回答竞争的遇险召唤,强调迫切需要在对话中进行更好的评估技巧,我们介绍了人类评估的成功发展,这是高度可靠的,同时仍然保持可行和低成本。自我复制实验几乎显示出$ r = 0.969 $的相关性几乎完全可重复的结果。此外,由于缺乏适当的统计显着性测试方法,在对话评估中很少考虑因偶然发生的系统的潜在改进的可能性,而我们提出的评估有助于应用标准测试。由于我们已经开发了一种高度可靠的评估方法,因此可以揭示对系统性能的新见解。因此,我们包括具有或没有角色的最先进模型(i)的比较,以衡量角色对对话质量的贡献,以及(ii)规定的与自由选择的主题。有趣的是,在角色方面,结果表明角色并未按预期的是对对话质量的积极贡献。

Evaluation of open-domain dialogue systems is highly challenging and development of better techniques is highlighted time and again as desperately needed. Despite substantial efforts to carry out reliable live evaluation of systems in recent competitions, annotations have been abandoned and reported as too unreliable to yield sensible results. This is a serious problem since automatic metrics are not known to provide a good indication of what may or may not be a high-quality conversation. Answering the distress call of competitions that have emphasized the urgent need for better evaluation techniques in dialogue, we present the successful development of human evaluation that is highly reliable while still remaining feasible and low cost. Self-replication experiments reveal almost perfectly repeatable results with a correlation of $r=0.969$. Furthermore, due to the lack of appropriate methods of statistical significance testing, the likelihood of potential improvements to systems occurring due to chance is rarely taken into account in dialogue evaluation, and the evaluation we propose facilitates application of standard tests. Since we have developed a highly reliable evaluation method, new insights into system performance can be revealed. We therefore include a comparison of state-of-the-art models (i) with and without personas, to measure the contribution of personas to conversation quality, as well as (ii) prescribed versus freely chosen topics. Interestingly with respect to personas, results indicate that personas do not positively contribute to conversation quality as expected.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源