论文标题

DS-1000:数据科学代码生成的自然可靠基准

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

论文作者

Lai, Yuhang, Li, Chengxi, Wang, Yiming, Zhang, Tianyi, Zhong, Ruiqi, Zettlemoyer, Luke, Yih, Scott Wen-tau, Fried, Daniel, Wang, Sida, Yu, Tao

论文摘要

我们介绍了DS-1000,这是一个代码生成基准,其中具有一千个数据科学问题,涵盖了七个Python库,例如Numpy和Pandas。与先前的作品相比,DS-1000结合了三个核心功能。首先,我们的问题反映了各种,现实和实用的用例,因为我们从Stackoverflow收集了它们。其次,我们的自动评估是高度具体的(可靠) - 在我们评估接受的所有Codex-002预测的解决方案中,只有1.8%的方法不正确;我们通过多标准指标实现这一目标,通过运行测试用例和表面形式约束来检查功能正确性,通过限制API使用或关键字。最后,我们通过稍微修改我们的问题与原始Stackoverflow源不同,主动为背景而主动防御记忆。因此,模型无法通过记住预训练的解决方案来正确回答它们。当前最佳公共系统(Codex-002)的精度为43.3%,留出了足够的改进空间。我们在https://ds1000-code-gen.github.io上发布基准。

We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-002-predicted solutions that our evaluation accept, only 1.8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43.3% accuracy, leaving ample room for improvement. We release our benchmark at https://ds1000-code-gen.github.io.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源