DS-1000：数据科学代码生成的自然可靠基准

论文标题

DS-1000：数据科学代码生成的自然可靠基准

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

论文作者

Lai, Yuhang, Li, Chengxi, Wang, Yiming, Zhang, Tianyi, Zhong, Ruiqi, Zettlemoyer, Luke, Yih, Scott Wen-tau, Fried, Daniel, Wang, Sida, Yu, Tao

论文摘要

我们介绍了DS-1000，这是一个代码生成基准，其中具有一千个数据科学问题，涵盖了七个Python库，例如Numpy和Pandas。与先前的作品相比，DS-1000结合了三个核心功能。首先，我们的问题反映了各种，现实和实用的用例，因为我们从Stackoverflow收集了它们。其次，我们的自动评估是高度具体的（可靠） - 在我们评估接受的所有Codex-002预测的解决方案中，只有1.8％的方法不正确；我们通过多标准指标实现这一目标，通过运行测试用例和表面形式约束来检查功能正确性，通过限制API使用或关键字。最后，我们通过稍微修改我们的问题与原始Stackoverflow源不同，主动为背景而主动防御记忆。因此，模型无法通过记住预训练的解决方案来正确回答它们。当前最佳公共系统（Codex-002）的精度为43.3％，留出了足够的改进空间。我们在https://ds1000-code-gen.github.io上发布基准。

We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-002-predicted solutions that our evaluation accept, only 1.8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43.3% accuracy, leaving ample room for improvement. We release our benchmark at https://ds1000-code-gen.github.io.

下载PDF全文

下载文献需遵守相关版权规定

论文标题