评估人类计算机系统性能的测试

论文标题

评估人类计算机系统性能的测试

A Test for Evaluating Performance in Human-Computer Systems

论文作者

Campero, Andres, Vaccaro, Michelle, Song, Jaeyoon, Wen, Haoran, Almaatouq, Abdullah, Malone, Thomas W.

论文摘要

将计算机性能与人类进行比较的图灵测试是众所周知的，但是，令人惊讶的是，没有广泛使用的测试可以比较更好的人类计算机系统相对于人类，单独的计算机或其他基线的比较。在这里，我们展示了如何使用均值之比作为效果大小的量度进行此类测试。然后，我们以三种方式演示了该测试的使用。首先，在对最近发布的79个实验结果的分析中，我们发现，令人惊讶的是，超过一半的研究发现性能下降，均值和中位数提高的比率均约为1个（完全没有改善），最大比率为1.36（36％）。其次，当100名人类程序员使用GPT-3生成软件时，我们是否会获得更高的性能提高率，这是一个庞大的，最先进的AI系统。在这种情况下，我们发现速度提高率为1.27（提高27％）。最后，我们发现使用GPT-3的50名非编程者可以执行与人类程序员相比，而且额外付费的任务。在这种情况下，非编程器和计算机都无法单独执行任务，因此这是人类计算机协同作用非常强烈的一个例子。

The Turing test for comparing computer performance to that of humans is well known, but, surprisingly, there is no widely used test for comparing how much better human-computer systems perform relative to humans alone, computers alone, or other baselines. Here, we show how to perform such a test using the ratio of means as a measure of effect size. Then we demonstrate the use of this test in three ways. First, in an analysis of 79 recently published experimental results, we find that, surprisingly, over half of the studies find a decrease in performance, the mean and median ratios of performance improvement are both approximately 1 (corresponding to no improvement at all), and the maximum ratio is 1.36 (a 36% improvement). Second, we experimentally investigate whether a higher performance improvement ratio is obtained when 100 human programmers generate software using GPT-3, a massive, state-of-the-art AI system. In this case, we find a speed improvement ratio of 1.27 (a 27% improvement). Finally, we find that 50 human non-programmers using GPT-3 can perform the task about as well as--and less expensively than--the human programmers. In this case, neither the non-programmers nor the computer would have been able to perform the task alone, so this is an example of a very strong form of human-computer synergy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题