基于上下文感知代码翻译的代码搜索

论文标题

基于上下文感知代码翻译的代码搜索

Code Search based on Context-aware Code Translation

论文作者

Sun, Weisong, Fang, Chunrong, Chen, Yuchen, Tao, Guanhong, Han, Tingxu, Zhang, Quanjun

论文摘要

代码搜索是开发人员在软件开发过程中广泛使用的技术。它根据其查询提供了从大型代码语料库到开发人员的语义相似的实现。现有技术利用深度学习模型分别构建代码片段和查询的嵌入表示形式。通常使用抽象句法树，控制流程图等特征来表示代码片段的语义。但是，这些功能的相同结构并不一定表示代码片段的相同语义，反之亦然。此外，这些技术还利用了多个不同的单词映射函数，这些函数映射了查询单词/代码令牌以嵌入表示表示。这会导致在查询和代码段中相同单词/令牌的嵌入。我们提出了一种新颖的上下文感知代码翻译技术，该技术将代码片段转换为自然语言描述（称为翻译）。代码翻译是根据机器说明进行的，在该指令中，通过模拟指令执行来收集上下文信息。我们进一步设计了一个共享的单词映射函数，使用一个词汇来生成翻译和查询的嵌入。我们在具有1,000个查询的CodesearchNet语料库上评估了我们的技术的有效性，称为Trancs。实验结果表明，就MRR而言，TRANC的表现显着优于最先进的技术，达到49.31％，至66.50％（平均相互等级）。

Code search is a widely used technique by developers during software development. It provides semantically similar implementations from a large code corpus to developers based on their queries. Existing techniques leverage deep learning models to construct embedding representations for code snippets and queries, respectively. Features such as abstract syntactic trees, control flow graphs, etc., are commonly employed for representing the semantics of code snippets. However, the same structure of these features does not necessarily denote the same semantics of code snippets, and vice versa. In addition, these techniques utilize multiple different word mapping functions that map query words/code tokens to embedding representations. This causes diverged embeddings of the same word/token in queries and code snippets. We propose a novel context-aware code translation technique that translates code snippets into natural language descriptions (called translations). The code translation is conducted on machine instructions, where the context information is collected by simulating the execution of instructions. We further design a shared word mapping function using one single vocabulary for generating embeddings for both translations and queries. We evaluate the effectiveness of our technique, called TranCS, on the CodeSearchNet corpus with 1,000 queries. Experimental results show that TranCS significantly outperforms state-of-the-art techniques by 49.31% to 66.50% in terms of MRR (mean reciprocal rank).

下载PDF全文

下载文献需遵守相关版权规定

论文标题