论文标题

自动推荐代码更新:我们到了吗?

Automatically Recommend Code Updates: Are We There Yet?

论文作者

Liu, Yue, Tantithamthavorn, Chakkrit, Liu, Yonghui, Thongtanunam, Patanamon, Li, Li

论文摘要

近年来,大型的预训练的代码模型(CODELMS)在各种软件工程任务上显示出令人鼓舞的结果。其中之一是自动代码更新建议,它将过时的代码片段转换为已批准和修订的对应物。尽管已经提出了许多基于CODEL的方法,但声称其准确性很高,但它们对现实代码更新任务的有效性和可靠性仍然值得怀疑。在本文中,我们为自动推荐代码更新的最新代码列出了第一个广泛的评估。我们在两个配对更新的方法的两个不同数据集上评估了它们的性能,这些方法考虑了时间演化,项目特异性,方法大小和更新复杂性等因素。我们的结果表明,虽然Codelm在忽略时间信息的设置中表现良好,但它们在更现实的时间情景中挣扎,并且对新项目的推广不佳。此外,对于较大的方法和更复杂的更新,CODELM性能显着降低。此外,我们观察到许多CodelM生成的“更新”实际上是无效的,尤其是在时间的设置中,有意义的编辑仍然具有挑战性。我们的发现强调了Codelms对现实代码更新建议的感知和实际有效性之间的显着差距,并强调需要进行更多研究,以改善其实用性,鲁棒性和普遍性。

In recent years, large pre-trained Language Models of Code (CodeLMs) have shown promising results on various software engineering tasks. One such task is automatic code update recommendation, which transforms outdated code snippets into their approved and revised counterparts. Although many CodeLM-based approaches have been proposed, claiming high accuracy, their effectiveness and reliability on real-world code update tasks remain questionable. In this paper, we present the first extensive evaluation of state-of-the-art CodeLMs for automatically recommending code updates. We assess their performance on two diverse datasets of paired updated methods, considering factors such as temporal evolution, project specificity, method size, and update complexity. Our results reveal that while CodeLMs perform well in settings that ignore temporal information, they struggle in more realistic time-wise scenarios and generalize poorly to new projects. Furthermore, CodeLM performance decreases significantly for larger methods and more complex updates. Furthermore, we observe that many CodeLM-generated "updates" are actually null, especially in time-wise settings, and meaningful edits remain challenging. Our findings highlight the significant gap between the perceived and actual effectiveness of CodeLMs for real-world code update recommendation and emphasize the need for more research on improving their practicality, robustness, and generalizability.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源