论文标题
FIXEVAL:基于执行的程序修复程序评估编程问题
FixEval: Execution-based Evaluation of Program Fixes for Programming Problems
论文作者
论文摘要
现代软件的复杂性导致了与检测和纠正软件错误相关的时间和成本的急剧增加。作为回应,研究人员探索了各种方法,以自动生成错误代码的修复程序。但是,由于任何给定的错误的可能修复程序的组合空间较大,因此很少有工具和数据集有效地评估模型生成的修复程序。为了解决这个问题,我们介绍了FixeVal,这是一个由竞争性编程问题及其相应修复程序组成的基准,该基准包括由Buggy Code提交的基准提交。 Fixeval提供了广泛的单位测试集合,以评估模型生成的程序修复的正确性,并根据判决的时间,记忆约束和接受评估更多信息。我们将两种在编程语言上预处理的变压器语言模型视为我们的基准,并使用基于匹配和基于执行的评估指标对其进行比较。我们的实验表明,基于匹配的指标不能准确反映模型生成的程序修复。同时,基于执行的方法通过为该解决方案明确设计的所有情况和场景评估程序。因此,我们认为FixeVal为实际自动错误修复和模型生成的代码评估提供了一步。数据集和模型在https://github.com/mahimanzum/fixeval上进行开源。
The complexity of modern software has led to a drastic increase in the time and cost associated with detecting and rectifying software bugs. In response, researchers have explored various methods to automatically generate fixes for buggy code. However, due to the large combinatorial space of possible fixes for any given bug, few tools and datasets are available to evaluate model-generated fixes effectively. To address this issue, we introduce FixEval, a benchmark comprising of buggy code submissions to competitive programming problems and their corresponding fixes. FixEval offers an extensive collection of unit tests to evaluate the correctness of model-generated program fixes and assess further information regarding time, memory constraints, and acceptance based on a verdict. We consider two Transformer language models pretrained on programming languages as our baseline and compare them using match-based and execution-based evaluation metrics. Our experiments show that match-based metrics do not reflect model-generated program fixes accurately. At the same time, execution-based methods evaluate programs through all cases and scenarios designed explicitly for that solution. Therefore, we believe FixEval provides a step towards real-world automatic bug fixing and model-generated code evaluation. The dataset and models are open-sourced at https://github.com/mahimanzum/FixEval.