提案：GitHub提交的大型预训练模型

论文标题

提案：GitHub提交的大型预训练模型

CommitBART: A Large Pre-trained Model for GitHub Commits

论文作者

Liu, Shangqing, Li, Yanzhou, Xie, Xiaofei, Liu, Yang

论文摘要

Github提交的记录，该代码随着自然语言消息的描述而变化，对于软件开发人员来说，在理解软件演变方面起着至关重要的作用。为了促进开源软件社区的开发，我们收集了一个提交基准，包括790万次跨7种编程语言的投入。基于此基准测试，我们提出了Citsbart，这是GitHub提交的大型预训练的编码器变压器模型。该模型由三个类别（即，授予目标，跨模式生成和对比度学习）预先培训，以学习训练的训练任务。此外，我们将``委托智能''框架与一个理解任务和提交的三个世代任务统一。这些任务的综合实验表明，提交委员会的表现优于以前的代码预先训练的作品。进一步的分析还揭示了每个预训练任务可增强模型性能。

GitHub commits, which record the code changes with natural language messages for description, play a critical role for software developers to comprehend the software evolution. To promote the development of the open-source software community, we collect a commit benchmark including over 7.99 million commits across 7 programming languages. Based on this benchmark, we present CommitBART, a large pre-trained encoder-decoder Transformer model for GitHub commits. The model is pre-trained by three categories (i.e., denoising objectives, cross-modal generation and contrastive learning) for six pre-training tasks to learn commit fragment representations. Furthermore, we unify a ``commit intelligence'' framework with one understanding task and three generation tasks for commits. The comprehensive experiments on these tasks demonstrate that CommitBARTsignificantly outperforms previous pre-trained works for code. Further analysis also reveals each pre-training task enhances the model performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题