从人类日到机器秒：自动回答和生成机器学习期末考试

论文标题

从人类日到机器秒：自动回答和生成机器学习期末考试

From Human Days to Machine Seconds: Automatically Answering and Generating Machine Learning Final Exams

论文作者

Drori, Iddo, Zhang, Sarah J., Shuttleworth, Reece, Zhang, Sarah, Tyser, Keith, Chin, Zad, Lantigua, Pedro, Surbehera, Saisamrit, Hunter, Gregory, Austin, Derek, Tang, Leonard, Hicke, Yann, Simhon, Sage, Karnik, Sathwik, Granberry, Darnell, Udell, Madeleine

论文摘要

麻省理工学院，哈佛或康奈尔等顶级机构的机器学习期末考试通常需要教师的日子和学生的时间来解决。我们证明，大型语言模型在培训模型后在线获得的机器学习决赛，并在培训模型后在线获得，并在几秒钟内自动产生新的人为最终考试问题。先前的工作已经开发了计划合成和几乎没有射击的学习方法，以解决大学级问题的数学和STEM课程中的问题。在这项工作中，我们开发和比较解决最终考试的方法，这些方法与问题集不同：问题更长，有多个部分，更复杂，并且跨越了更广泛的主题。我们策划了来自机器学习最终考试的问题的数据集和基准标准，以及用于回答这些问题并产生新问题的代码。我们展示了如何从其他问题和课程注释中产生新问题。为了在最终考试基准上进行可重复性和未来的研究，我们使用自动检查器进行多项选择，数字和带有表达答案的问题。我们进行了消融研究，将零射击学习与几乎没有的学习和经过三分链条的提示进行比较，并在机器学习主题中使用GPT-3，OPT，CODEX和CHATGPT进行了比较，发现很少有射击的学习方法表现最好。我们强调了语言模型简化大规模评估的写作和解决方案的变革潜力，从而大大减少了从人类日到仅仅机器秒的工作量。我们的结果表明，与其禁止在课堂上的大型语言模型（例如Chatgpt），不如教学生通过向学生询问元问题的正确性，完整性和独创性来利用他们，并在学术研究中鼓励批判性思维。

A final exam in machine learning at a top institution such as MIT, Harvard, or Cornell typically takes faculty days to write, and students hours to solve. We demonstrate that large language models pass machine learning finals at a human level, on finals available online after the models were trained, and automatically generate new human-quality final exam questions in seconds. Previous work has developed program synthesis and few-shot learning methods to solve university-level problem set questions in mathematics and STEM courses. In this work, we develop and compare methods that solve final exams, which differ from problem sets in several ways: the questions are longer, have multiple parts, are more complicated, and span a broader set of topics. We curate a dataset and benchmark of questions from machine learning final exams available online and code for answering these questions and generating new questions. We show how to generate new questions from other questions and course notes. For reproducibility and future research on this final exam benchmark, we use automatic checkers for multiple-choice, numeric, and questions with expression answers. We perform ablation studies comparing zero-shot learning with few-shot learning and chain-of-thought prompting using GPT-3, OPT, Codex, and ChatGPT across machine learning topics and find that few-shot learning methods perform best. We highlight the transformative potential of language models to streamline the writing and solution of large-scale assessments, significantly reducing the workload from human days to mere machine seconds. Our results suggest that rather than banning large language models such as ChatGPT in class, instructors should teach students to harness them by asking students meta-questions about correctness, completeness, and originality of the responses generated, encouraging critical thinking in academic studies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题