用于生成代码知识图的工具包

论文标题

用于生成代码知识图的工具包

A Toolkit for Generating Code Knowledge Graphs

论文作者

Abdelaziz, Ibrahim, Dolby, Julian, McCusker, Jamie, Srinivas, Kavitha

论文摘要

事实证明，知识图在为语义搜索和自然语言理解方面的多种应用程序提供动力方面非常有用。在本文中，我们提出了GraphGen4Code，这是一种构建代码知识图的工具包，可以类似地为各种应用程序供电，例如程序搜索，代码理解，错误检测和代码自动化。 GraphGen4Code使用通用技术来捕获代码语义，并用代表类，函数和方法的图表中的关键节点捕获代码语义。边缘指示功能用法（例如，数据如何流经函数调用，是从真实代码的程序分析中得出的）以及有关函数的文档（例如，代码文档，使用文档文档或论坛讨论，例如StackoverFlow）。我们的工具包使用RDF中的命名图用于每个程序模型图形，或者可以将图形作为JSON输出。我们通过将其应用于从Github绘制的130万个Python文件，2,300个Python模块和4700万个论坛帖子来显示该工具包的可扩展性。这将产生一个集成的代码图，三倍超过20亿。我们制作工具包来构建此类图，以及公开可用于使用的20亿个三元图的样品提取。

Knowledge graphs have been proven extremely useful in powering diverse applications in semantic search and natural language understanding. In this paper, we present GraphGen4Code, a toolkit to build code knowledge graphs that can similarly power various applications such as program search, code understanding, bug detection, and code automation. GraphGen4Code uses generic techniques to capture code semantics with the key nodes in the graph representing classes, functions, and methods. Edges indicate function usage (e.g., how data flows through function calls, as derived from program analysis of real code), and documentation about functions (e.g., code documentation, usage documentation, or forum discussions such as StackOverflow). Our toolkit uses named graphs in RDF to model graphs per program, or can output graphs as JSON. We show the scalability of the toolkit by applying it to 1.3 million Python files drawn from GitHub, 2,300 Python modules, and 47 million forum posts. This results in an integrated code graph with over 2 billion triples. We make the toolkit to build such graphs as well as the sample extraction of the 2 billion triples graph publicly available to the community for use.

下载PDF全文

下载文献需遵守相关版权规定

论文标题