您的面向目标的对话模型的性能真的很好吗？系统评估的经验分析

论文标题

您的面向目标的对话模型的性能真的很好吗？系统评估的经验分析

Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical Analysis of System-wise Evaluation

论文作者

Takanobu, Ryuichi, Zhu, Qi, Li, Jinchao, Peng, Baolin, Gao, Jianfeng, Huang, Minlie

论文摘要

人们对开发面向目标的对话系统的兴趣越来越兴趣，这些对话系统可以通过多转交谈来为用户提供复杂的任务。尽管设计了许多方法来评估和改善单个对话组件的性能，但缺乏关于不同组件如何促进对话系统整体性能的全面经验研究。在本文中，我们对系统进行评估，并对不同类型的对话系统进行经验分析，该系统由不同的设置中的不同模块组成。我们的结果表明，（1）使用在不同组件级别上使用细粒度的监督信号训练的管道对话系统通常比使用在粗粒标签上训练的联合或端到端模型的系统更好地获得性能，（（2）组件的总评估结果并不总是与对对话率系统的整体评估以及（3）的整体评估，以及（3），以及（3），（3）的仿真人类习惯和人类的逐步评估，（3）是人类的审查，（3）是人类的审查，（3）是人类的审查，（3）是人类习惯，人类的次要评估结果（3）特别是在发展的早期。

There is a growing interest in developing goal-oriented dialog systems which serve users in accomplishing complex tasks through multi-turn conversations. Although many methods are devised to evaluate and improve the performance of individual dialog components, there is a lack of comprehensive empirical study on how different components contribute to the overall performance of a dialog system. In this paper, we perform a system-wise evaluation and present an empirical analysis on different types of dialog systems which are composed of different modules in different settings. Our results show that (1) a pipeline dialog system trained using fine-grained supervision signals at different component levels often obtains better performance than the systems that use joint or end-to-end models trained on coarse-grained labels, (2) component-wise, single-turn evaluation results are not always consistent with the overall performance of a dialog system, and (3) despite the discrepancy between simulators and human users, simulated evaluation is still a valid alternative to the costly human evaluation especially in the early stage of development.

下载PDF全文

下载文献需遵守相关版权规定

论文标题