论文标题
在大型预训练语言模型时代的实用程序维修
Practical Program Repair in the Era of Large Pre-trained Language Models
论文作者
论文摘要
自动化程序维修(APR)旨在帮助开发人员自动修补软件错误。但是,当前的最新传统和基于学习的APR技术面临着有限的补丁的问题,无法修复复杂的错误。这主要是由于依靠固定数据集来制作修复模板或直接预测潜在的补丁。使用数十亿个文本/代码令牌培训的大型预训练的语言模型(PLM)可能有助于避免此问题。最近,研究人员直接利用PLM进行了APR,而无需依赖任何固定数据集。同时,这种现有的工作要么未能包括最先进的PLM,要么未在现实数据集上进行评估。 在这项工作中,我们对直接应用PLM的APR进行了首次广泛的研究。我们选择了9个最新的最先进的PLM,包括生成型和填充模型,范围从125m到20B。我们设计了3种不同的维修设置,以评估使用PLM生成补丁的不同方式。我们在这些维修设置下将PLM应用于3种不同语言的5个数据集上,并比较固定的错误数量,生成速度和编译速率的不同PLM。我们的研究表明,直接应用最先进的PLM已经可以在我们所有数据集上大大优于所有现有的APR技术。在所研究的PLM中,APR的缩放效应存在,较大的模型倾向于实现更好的性能。另外,我们首次表明,在货车线路之后的后缀代码(在填充式APR中采用)不仅在生成更多的修复程序,而且还具有更高的汇编速率的补丁。除了贴片生成外,PLM还认为正确的补丁比其他贴片更自然,甚至可以利用有效的补丁排名或补丁正确性检查。
Automated Program Repair (APR) aims to help developers automatically patch software bugs. However, current state-of-the-art traditional and learning-based APR techniques face the problem of limited patch variety, failing to fix complicated bugs. This is mainly due to the reliance on bug-fixing datasets to craft fix templates or directly predict potential patches. Large Pre-Trained Language Models (PLMs), trained using billions of text/code tokens, can potentially help avoid this issue. Very recently, researchers have directly leveraged PLMs for APR without relying on any bug-fixing datasets. Meanwhile, such existing work either failed to include state-of-the-art PLMs or was not evaluated on realistic datasets. In this work, we perform the first extensive study on directly applying PLMs for APR. We select 9 recent state-of-the-art PLMs, including both generative and infilling models, ranging from 125M to 20B in size. We designed 3 different repair settings to evaluate the different ways we can use PLMs to generate patches. We apply the PLMs under these repair settings on 5 datasets across 3 different languages and compare different PLMs in the number of bugs fixed, generation speed and compilation rate. Our study demonstrates that directly applying state-of-the-art PLMs can already substantially outperform all existing APR techniques on all our datasets. Among the studied PLMs, the scaling effect exists for APR where larger models tend to achieve better performance. Also, we show for the first time that suffix code after the buggy line (adopted in infilling-style APR) is important in not only generating more fixes but more patches with higher compilation rate. Besides patch generation, the PLMs consider correct patches to be more natural than other ones, and can even be leveraged for effective patch ranking or patch correctness checking.