论文标题
发现机器人:评估对话系统系统的强大有效框架
Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems
论文作者
论文摘要
缺乏时间效率和可靠的评估方法阻碍了对话对话系统(聊天机器人)的发展。要求人类与聊天机器人交谈的评估是时间和成本密集的,对人的法官提出了很高的认知要求,并产生低质量的结果。在这项工作中,我们介绍了\ emph {spot the bot},这是一个具有成本效益且稳健的评估框架,可以用机器人之间的对话代替人类机器人对话。然后,人的法官只对谈话中的每个实体进行注释,无论他们是否认为是人类(假设有人类参与这些对话)。然后,这些注释使我们可以对聊天机器人进行模仿人类对话行为的能力进行排名。由于我们期望所有机器人最终都得到认可,因此我们结合了一个指标,该指标可以衡量聊天机器人可以维持最长的人类行为,即\ emph {生存分析}。该指标具有将机器人的性能与其某些特性(例如\ \ flumentions \ flumentimentes)相关联的能力,从而产生可解释的结果。我们框架的价格相当低,可以在评估周期中对聊天机器人进行频繁的评估。我们通过将\ emph {将bot}应用于三个域,评估几个最先进的聊天机器人并对相关工作进行比较来验证我们的主张。该框架作为现成工具发布。
The lack of time-efficient and reliable evaluation methods hamper the development of conversational dialogue systems (chatbots). Evaluations requiring humans to converse with chatbots are time and cost-intensive, put high cognitive demands on the human judges, and yield low-quality results. In this work, we introduce \emph{Spot The Bot}, a cost-efficient and robust evaluation framework that replaces human-bot conversations with conversations between bots. Human judges then only annotate for each entity in a conversation whether they think it is human or not (assuming there are humans participants in these conversations). These annotations then allow us to rank chatbots regarding their ability to mimic the conversational behavior of humans. Since we expect that all bots are eventually recognized as such, we incorporate a metric that measures which chatbot can uphold human-like behavior the longest, i.e., \emph{Survival Analysis}. This metric has the ability to correlate a bot's performance to certain of its characteristics (e.g., \ fluency or sensibleness), yielding interpretable results. The comparably low cost of our framework allows for frequent evaluations of chatbots during their evaluation cycle. We empirically validate our claims by applying \emph{Spot The Bot} to three domains, evaluating several state-of-the-art chatbots, and drawing comparisons to related work. The framework is released as a ready-to-use tool.