评估源代码解析器对ML4SE模型的影响

论文标题

评估源代码解析器对ML4SE模型的影响

Evaluating the Impact of Source Code Parsers on ML4SE Models

论文作者

Utkin, Ilya, Spirin, Egor, Bogomolov, Egor, Bryksin, Timofey

论文摘要

随着研究人员和从业人员将机器学习应用于越来越多的软件工程问题，他们使用的方法变得更加复杂。许多现代方法都以抽象语法树（AST）或其扩展形式使用内部代码结构：基于路径的表示，将AST与其他边缘结合在一起的复杂图。尽管可以使用不同的解析器来从代码中提取AST的过程，但选择解析器对最终模型质量的影响仍然没有研究。此外，研究人员经常省略提取特定代码表示的确切细节。在这项工作中，我们在方法名称名称预测任务中评估了两个模型，即Code2Seq和Treelstm，由八个不同的解析器支持Java语言。为了将数据制备的过程与不同的解析器统一，我们开发了SuperParser，SuperParser是基于Pathminer的多语言解析器 - 无知库。 SuperParser促进了适用于培训和评估ML模型的数据集的端到端创建，这些模型与源代码中的结构信息一起使用。我们的结果表明，不同解析器建造的树木的结构和内容各不相同。然后，我们分析这种多样性如何影响模型的质量，并表明两种模型的最不合适的解析器之间的质量差距非常重要。最后，我们讨论了解析器的其他功能，研究人员和从业人员在选择解析器时应考虑这些特征，以及对模型质量的影响。 The code of SuperParser is publicly available at https://doi.org/10.5281/zenodo.6366591.我们还发布了Java-Norm，即我们用于评估模型的数据集：https：//doi.org/10.5281/zenodo.6366599。

As researchers and practitioners apply Machine Learning to increasingly more software engineering problems, the approaches they use become more sophisticated. A lot of modern approaches utilize internal code structure in the form of an abstract syntax tree (AST) or its extensions: path-based representation, complex graph combining AST with additional edges. Even though the process of extracting ASTs from code can be done with different parsers, the impact of choosing a parser on the final model quality remains unstudied. Moreover, researchers often omit the exact details of extracting particular code representations. In this work, we evaluate two models, namely Code2Seq and TreeLSTM, in the method name prediction task backed by eight different parsers for the Java language. To unify the process of data preparation with different parsers, we develop SuperParser, a multi-language parser-agnostic library based on PathMiner. SuperParser facilitates the end-to-end creation of datasets suitable for training and evaluation of ML models that work with structural information from source code. Our results demonstrate that trees built by different parsers vary in their structure and content. We then analyze how this diversity affects the models' quality and show that the quality gap between the most and least suitable parsers for both models turns out to be significant. Finally, we discuss other features of the parsers that researchers and practitioners should take into account when selecting a parser along with the impact on the models' quality. The code of SuperParser is publicly available at https://doi.org/10.5281/zenodo.6366591. We also publish Java-norm, the dataset we use to evaluate the models: https://doi.org/10.5281/zenodo.6366599.

下载PDF全文

下载文献需遵守相关版权规定

论文标题