OCOR：重叠感知的代码检索器

论文标题

OCOR：重叠感知的代码检索器

OCoR: An Overlapping-Aware Code Retriever

论文作者

Zhu, Qihao, Sun, Zeyu, Liang, Xiran, Xiong, Yingfei, Zhang, Lu

论文摘要

代码检索可帮助开发人员在开源项目中重复使用代码段。给定自然语言描述，代码检索旨在在一组代码中搜索最相关的代码。现有的最新方法应用神经网络来代码检索。但是，这些方法仍然无法捕获一个重要功能：重叠。不同人使用的不同名称之间的重叠表明，两个不同的名称可能是潜在相关的（例如“消息”和“ msg”），并且代码中的标识符与自然语言描述中的标识符之间的重叠表明，代码snippet和描述可能有可能相关。为了解决这些问题，我们提出了一种名为OCOR的新型神经结构，在这里我们介绍了两个专门设计的组件来捕获重叠：第一个按角色嵌入标识符来捕获标识符之间的重叠，第二个介绍了一个新颖的重叠矩阵，以代表每个自然语言文字和每个识别器之间的重叠程度。评估是在两个已建立的数据集上进行的。实验结果表明，OCOR明显优于现有的最新方法，并提高了13.1％至22.3％。此外，我们还进行了几项深入的实验，以帮助了解OCOR中不同组件的性能。

Code retrieval helps developers reuse the code snippet in the open-source projects. Given a natural language description, code retrieval aims to search for the most relevant code among a set of code. Existing state-of-the-art approaches apply neural networks to code retrieval. However, these approaches still fail to capture an important feature: overlaps. The overlaps between different names used by different people indicate that two different names may be potentially related (e.g., "message" and "msg"), and the overlaps between identifiers in code and words in natural language descriptions indicate that the code snippet and the description may potentially be related. To address these problems, we propose a novel neural architecture named OCoR, where we introduce two specifically-designed components to capture overlaps: the first embeds identifiers by character to capture the overlaps between identifiers, and the second introduces a novel overlap matrix to represent the degrees of overlaps between each natural language word and each identifier. The evaluation was conducted on two established datasets. The experimental results show that OCoR significantly outperforms the existing state-of-the-art approaches and achieves 13.1% to 22.3% improvements. Moreover, we also conducted several in-depth experiments to help understand the performance of different components in OCoR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题