通过分层语言模型改善培训和测试完成代码完成的数据不一致的鲁棒性

论文标题

通过分层语言模型改善培训和测试完成代码完成的数据不一致的鲁棒性

Improving the Robustness to Data Inconsistency between Training and Testing for Code Completion by Hierarchical Language Model

论文作者

Yang, Yixiao

论文摘要

在软件工程领域，将语言模型应用于源代码的令牌顺序是制造代码建议系统的最新方法。源代码的语法树具有层次结构。忽略树结构的特征会降低模型性能。当前LSTM模型处理顺序数据。如果测试套件中各处分布噪声数据，LSTM模型的性能将急剧降低。由于代码具有免费的命名约定，因此在一个项目上训练的模型很常见，可以在另一个项目上遇到许多未知单词。如果我们将许多看不见的单词设置为UNK，就像自然语言处理中的解决方案一样，UNK的数量将比最常见的单词的总和要大得多。在极端情况下，只要在任何地方预测UNK可能会达到很高的预测准确性。因此，在遇到噪声看不见的数据时，这种解决方案无法反映模型的真实性能。在本文中，我们仅标记少数稀有单词为UNK，并在项目中和跨项目评估下显示模型的预测性能。我们提出了一种新型的分层语言模型（HLM），以提高LSTM模型的鲁棒性，以获得处理培训和测试之间数据分布不一致的能力。新提出的HLM考虑了代码树的层次结构以预测代码。 HLM使用Bilstm根据层次结构生成子树的嵌入，并收集子树在上下文中的嵌入以预测下一个代码。内部项目和跨项目集的实验表明，在处理训练和测试之间的数据不一致方面，新提出的层次级别模型（HLM）的性能优于ART LSTM模型，并且在预测准确性方面相平均为11.2 \％。

In the field of software engineering, applying language models to the token sequence of source code is the state-of-art approach to build a code recommendation system. The syntax tree of source code has hierarchical structures. Ignoring the characteristics of tree structures decreases the model performance. Current LSTM model handles sequential data. The performance of LSTM model will decrease sharply if the noise unseen data is distributed everywhere in the test suite. As code has free naming conventions, it is common for a model trained on one project to encounter many unknown words on another project. If we set many unseen words as UNK just like the solution in natural language processing, the number of UNK will be much greater than the sum of the most frequently appeared words. In an extreme case, just predicting UNK at everywhere may achieve very high prediction accuracy. Thus, such solution cannot reflect the true performance of a model when encountering noise unseen data. In this paper, we only mark a small number of rare words as UNK and show the prediction performance of models under in-project and cross-project evaluation. We propose a novel Hierarchical Language Model (HLM) to improve the robustness of LSTM model to gain the capacity about dealing with the inconsistency of data distribution between training and testing. The newly proposed HLM takes the hierarchical structure of code tree into consideration to predict code. HLM uses BiLSTM to generate embedding for sub-trees according to hierarchies and collects the embedding of sub-trees in context to predict next code. The experiments on inner-project and cross-project data sets indicate that the newly proposed Hierarchical Language Model (HLM) performs better than the state-of-art LSTM model in dealing with the data inconsistency between training and testing and achieves averagely 11.2\% improvement in prediction accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题