论文标题

对话评估通过离线加固学习

Dialogue Evaluation with Offline Reinforcement Learning

论文作者

Lubis, Nurul, Geishauser, Christian, Lin, Hsien-Chin, van Niekerk, Carel, Heck, Michael, Feng, Shutong, Gašić, Milica

论文摘要

面向任务的对话系统旨在通过自然语言互动实现用户目标。他们可以与人类用户一起评估它们,但是在开发阶段的每个迭代中都无法实现。模拟用户可能是一种替代方案,但是他们的开发并非繁琐。因此,研究人员诉诸于现有的人类语料库的离线指标,这些指标更实用且易于重现。不幸的是,它们在反映对话系统的真实性能方面受到限制。例如,BLEU与人类判断力的关系很差,现有的基于语料库的指标(例如成功率忽略对话上下文不匹配)。对于具有良好概括且与人类判断密切相关的任务导向系统,仍然需要一个可靠的指标。在本文中,我们建议使用离线增强学习来基于静态语料库的对话评估。这样的评估者通常称为评论家,并用于政策优化。我们向更进一步的一步,表明可以在任何对话系统的静态语料库上对离线RL批评家作为外部评估者进行培训,从而可以在各种类型的系统上进行对话性能比较。这种方法的好处是与语料库和模型无关,同时与人类判断达到牢固的相关性,我们通过交互式用户试验确认。

Task-oriented dialogue systems aim to fulfill user goals through natural language interactions. They are ideally evaluated with human users, which however is unattainable to do at every iteration of the development phase. Simulated users could be an alternative, however their development is nontrivial. Therefore, researchers resort to offline metrics on existing human-human corpora, which are more practical and easily reproducible. They are unfortunately limited in reflecting real performance of dialogue systems. BLEU for instance is poorly correlated with human judgment, and existing corpus-based metrics such as success rate overlook dialogue context mismatches. There is still a need for a reliable metric for task-oriented systems with good generalization and strong correlation with human judgements. In this paper, we propose the use of offline reinforcement learning for dialogue evaluation based on a static corpus. Such an evaluator is typically called a critic and utilized for policy optimization. We go one step further and show that offline RL critics can be trained on a static corpus for any dialogue system as external evaluators, allowing dialogue performance comparisons across various types of systems. This approach has the benefit of being corpus- and model-independent, while attaining strong correlation with human judgements, which we confirm via an interactive user trial.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源