论文标题
Lmentry:基本语言任务的语言模型基准
LMentry: A Language Model Benchmark of Elementary Language Tasks
论文作者
论文摘要
随着大语模型的性能迅速改善,基准也越来越大,而且越来越复杂。我们提出了Lmentry,这是一种基准,它通过专注于一组对人类微不足道的任务来避免这种“武器竞赛”,例如编写一个包含特定单词的句子,确定列表中的哪个单词属于特定类别,或选择两个单词中的哪个单词更长。 Lmentry的专门设计旨在为大语模型的能力和鲁棒性提供快速,可解释的见解。我们的实验揭示了各种各样的失败案例,尽管对人类来说立即显而易见,但对于大型语言模型,包括OpenAI最新的175B参数指令调整的模型TextDavinci002构成了巨大挑战。 Lmentry补充了大型语言模型的当代评估方法,提供了快速,自动和易于运行的“单元测试”,而无需诉诸于大型基准的复杂任务。
As the performance of large language models rapidly improves, benchmarks are getting larger and more complex as well. We present LMentry, a benchmark that avoids this "arms race" by focusing on a compact set of tasks that are trivial to humans, e.g. writing a sentence containing a specific word, identifying which words in a list belong to a specific category, or choosing which of two words is longer. LMentry is specifically designed to provide quick and interpretable insights into the capabilities and robustness of large language models. Our experiments reveal a wide variety of failure cases that, while immediately obvious to humans, pose a considerable challenge for large language models, including OpenAI's latest 175B-parameter instruction-tuned model, TextDavinci002. LMentry complements contemporary evaluation approaches of large language models, providing a quick, automatic, and easy-to-run "unit test", without resorting to large benchmark suites of complex tasks.