Codebert-NT：代码自然通过Codebert

论文标题

Codebert-NT：代码自然通过Codebert

CodeBERT-nt: code naturalness via CodeBERT

论文作者

Khanfir, Ahmed, Jimenez, Matthieu, Papadakis, Mike, Traon, Yves Le

论文摘要

软件工程研究的大部分依赖于代码的自然性，这一事实是，小型代码段中的代码是重复的，可以使用N-gram（例如N-gram）进行预测。尽管强大，但大型代码语料库上的这种模型却乏味，耗时且对培训期间遇到的代码模式（和实践）敏感。因此，这些模型通常经过小型语料库进行培训，并估算相对于特定的编程或项目类型的语言自然性。为了克服这些问题，我们建议使用预训练的语言模型推断代码自然性。预训练的模型通常是建立在大数据上的，易于以开箱即用的方式使用，并包括强大的学习协会机制。我们的关键思想是通过使用最先进的预训练的语言模型，通过其可预测性来量化代码自然性。为此，我们通过掩盖（省略）代码令牌来推断自然性，一次是一个代码序列，并检查模型预测它们的能力。为此，我们评估了三个不同的可预测性指标。 a）测量预测的确切匹配的数量，b）计算原始代码和预测代码之间的嵌入相似性，即在向量空间上的相似性，c）在执行令牌完成任务时计算模型的信心，与结果无关。我们实施此工作流程，名为Codebert-NT，并评估其根据其自然性对非笨拙的线条优先级优先于非笨蛋的功能。我们的结果是在SmartShark数据集中的40个项目的2510个错误版本上，表明Codebert-NT优于基于随机统一和基于复杂性的排名技术，并且比N-Gram模型产生可比的结果（稍好）。

Much of software-engineering research relies on the naturalness of code, the fact that code, in small code snippets, is repetitive and can be predicted using statistical language models like n-gram. Although powerful, training such models on large code corpus is tedious, time-consuming and sensitive to code patterns (and practices) encountered during training. Consequently, these models are often trained on a small corpora and estimate the language naturalness that is relative to a specific style of programming or type of project. To overcome these issues, we propose using pre-trained language models to infer code naturalness. Pre-trained models are often built on big data, are easy to use in an out-of-the-box way and include powerful learning associations mechanisms. Our key idea is to quantify code naturalness through its predictability, by using state-of-the-art generative pre-trained language models. To this end, we infer naturalness by masking (omitting) code tokens, one at a time, of code-sequences, and checking the models' ability to predict them. To this end, we evaluate three different predictability metrics; a) measuring the number of exact matches of the predictions, b) computing the embedding similarity between the original and predicted code, i.e., similarity at the vector space, and c) computing the confidence of the model when doing the token completion task irrespective of the outcome. We implement this workflow, named CodeBERT-nt, and evaluate its capability to prioritize buggy lines over non-buggy ones when ranking code based on its naturalness. Our results, on 2510 buggy versions of 40 projects from the SmartShark dataset, show that CodeBERT-nt outperforms both, random-uniform and complexity-based ranking techniques, and yields comparable results (slightly better) than the n-gram models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题