代码语法理解的基准测试语言模型

论文标题

代码语法理解的基准测试语言模型

Benchmarking Language Models for Code Syntax Understanding

论文作者

Shen, Da, Chen, Xinyun, Wang, Chenguang, Sen, Koushik, Song, Dawn

论文摘要

预先训练的语言模型在自然语言处理和程序理解中都表现出了令人印象深刻的表现，这将输入表示为代币序列，而无需明确建模其结构。一些先前的作品表明，预训练的语言模型可以捕获自然语言的句法规则，而无需对语法理解任务进行审核。但是，对预培训模型到目前为止的了解如何了解代码结构的理解有限。在这项工作中，我们对最先进的预培训模型进行了第一个彻底的基准测试，以识别程序的句法结构。具体而言，我们介绍了CodeSyntax，这是一个大规模的数据集，该程序的句法关系在其相应的抽象语法树中带有句法关系。我们的主要观察结果是，在代码上预测的现有语言模型仍然缺乏对代码语法的理解。实际上，这些预训练的编程语言模型无法根据位置偏移和关键字的简单基准的性能匹配。我们还提出了一种自然语言基准，以在句法结构理解方面突出自然语言和编程语言之间的差异。我们的发现指出了对编程语言现有预训练方法的关键局限性，并提出了建模代码句法结构的重要性。

Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding, which represent the input as a token sequence without explicitly modeling its structure. Some prior works show that pre-trained language models can capture the syntactic rules of natural languages without finetuning on syntax understanding tasks. However, there is limited understanding of how well pre-trained models understand the code structure so far. In this work, we perform the first thorough benchmarking of the state-of-the-art pre-trained models for identifying the syntactic structures of programs. Specifically, we introduce CodeSyntax, a large-scale dataset of programs annotated with the syntactic relationships in their corresponding abstract syntax trees. Our key observation is that existing language models pretrained on code still lack the understanding of code syntax. In fact, these pre-trained programming language models fail to match the performance of simple baselines based on positional offsets and keywords. We also present a natural language benchmark to highlight the differences between natural languages and programming languages in terms of syntactic structure understanding. Our findings point out key limitations of existing pre-training methods for programming languages, and suggest the importance of modeling code syntactic structures.

下载PDF全文

下载文献需遵守相关版权规定

论文标题