机器可以模仿人类吗？视觉和语言的整合图灵测试证明了狭窄的差距

论文标题

机器可以模仿人类吗？视觉和语言的整合图灵测试证明了狭窄的差距

Can Machines Imitate Humans? Integrative Turing Tests for Vision and Language Demonstrate a Narrowing Gap

论文作者

Zhang, Mengmi, Dellaferrera, Giorgia, Sikarwar, Ankur, Chen, Caishun, Armendariz, Marcelo, Mudrik, Noga, Agrawal, Prachi, Madan, Spandan, Shetty, Mranmay, Barbu, Andrei, Yang, Haochen, Kumar, Tanishq, Han, Shui'Er, Singh, Aman Raj, Sadwani, Meghna, Dellaferrera, Stella, Pizzochero, Michele, Tang, Brandon, Ong, Yew Soon, Pfister, Hanspeter, Kreiman, Gabriel

论文摘要

随着AI算法越来越多地参与日常活动，确定我们是否与人类互动的代理人至关重要。为了解决这个问题，我们在模仿三个语言任务（图像字幕，单词关联和对话）和三个视觉任务（对象检测，颜色估计和注意力预测）的能力中，将其用于模仿人类的能力，以系统地基准为电流AIS。这些实验涉及549种人类代理以及26个用于数据集创建的AI代理，还有1,126名人工法官和10名AI法官，在25,650种图灵样测试中。结果表明，当前的AIS距离能够以复杂的语言和视力挑战对人类不远。尽管人类法官经常受到欺骗，但简单的AI法官在将人类答案与AI答案区分开来优于人类法官。模仿测试的结果仅与AI中的标准性能指标最小化。因此，评估机器是否可以通过人类构成一个重要的独立测试来评估AI算法。此处介绍的精心策划的大规模图灵数据集及其评估指标提供了新的基准和见解，以评估代理是否是人类的，并强调了这些和其他AI域中严格，系统和定量模仿测试的相关性。

As AI algorithms increasingly participate in daily activities, it becomes critical to ascertain whether the agents we interact with are human or not. To address this question, we turn to the Turing test and systematically benchmark current AIs in their abilities to imitate humans in three language tasks (Image captioning, Word association, and Conversation) and three vision tasks (Object detection, Color estimation, and Attention prediction). The experiments involved 549 human agents plus 26 AI agents for dataset creation, and 1,126 human judges plus 10 AI judges, in 25,650 Turing-like tests. The results reveal that current AIs are not far from being able to impersonate humans in complex language and vision challenges. While human judges were often deceived, simple AI judges outperformed human judges in distinguishing human answers from AI answers. The results of imitation tests are only minimally correlated with standard performance metrics in AI. Thus, evaluating whether a machine can pass as a human constitutes an important independent test to evaluate AI algorithms. The curated, large-scale, Turing datasets introduced here and their evaluation metrics provide new benchmarks and insights to assess whether an agent is human or not and emphasize the relevance of rigorous, systematic, and quantitative imitation tests in these and other AI domains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题