通过实践驱动的评估对商业意图检测服务进行基准测试

论文标题

通过实践驱动的评估对商业意图检测服务进行基准测试

Benchmarking Commercial Intent Detection Services with Practice-Driven Evaluations

论文作者

Qi, Haode, Pan, Lin, Sood, Atin, Shah, Abhishek, Kunc, Ladislav, Yu, Mo, Potdar, Saloni

论文摘要

意图检测是现代目标对话框系统的关键组成部分，该系统通过预测用户文本输入的意图来完成用户任务。设计强大而准确的意图检测模型有三个主要的挑战。首先，典型的意图检测模型需要大量标记的数据才能达到高精度。不幸的是，在实际情况下，找到小，不平衡和嘈杂的数据集更为常见。其次，即使使用大型培训数据，在现实世界中部署时，意图检测模型也可以看到不同的测试数据分布，从而导致准确性差。最后，实用的意图检测模型必须在训练和单个查询推理中在计算上有效，以便可以经常使用并经常训练它。我们在各种数据集上基准测试了意图检测方法。我们的结果表明，沃森助手的意图检测模型表现优于其他商业解决方案，并且与大型的语言模型相媲美，同时仅需要一小部分计算资源和培训数据。当培训和测试分布不同时，沃森助理表现出更高程度的鲁棒性。

Intent detection is a key component of modern goal-oriented dialog systems that accomplish a user task by predicting the intent of users' text input. There are three primary challenges in designing robust and accurate intent detection models. First, typical intent detection models require a large amount of labeled data to achieve high accuracy. Unfortunately, in practical scenarios it is more common to find small, unbalanced, and noisy datasets. Secondly, even with large training data, the intent detection models can see a different distribution of test data when being deployed in the real world, leading to poor accuracy. Finally, a practical intent detection model must be computationally efficient in both training and single query inference so that it can be used continuously and re-trained frequently. We benchmark intent detection methods on a variety of datasets. Our results show that Watson Assistant's intent detection model outperforms other commercial solutions and is comparable to large pretrained language models while requiring only a fraction of computational resources and training data. Watson Assistant demonstrates a higher degree of robustness when the training and test distributions differ.

下载PDF全文

下载文献需遵守相关版权规定

论文标题