论文标题
Unixcoder:代码表示的统一跨模式预训练
UniXcoder: Unified Cross-Modal Pre-training for Code Representation
论文作者
论文摘要
编程语言的预训练模型最近在代码智能上表现出了巨大的成功。为了支持与代码相关的理解和生成任务,最近的工作试图预先培训统一的编码器模型。但是,这种编码器框架的框架是自动回归任务的次优最佳选择,尤其是代码完成,需要一种仅解码方式才能有效推断。在本文中,我们提出了Unixcoder,这是一种用于编程语言的统一的跨模式预训练模型。该模型利用带有前缀适配器的蒙版注意矩阵来控制模型的行为,并利用诸如AST和代码注释之类的跨模式内容来增强代码表示。要编码并行表示为树的AST,我们提出了一种一对一的映射方法,以将AST转换为序列结构,该序列结构保留了从树中保留所有结构信息。此外,我们建议利用多模式内容来学习使用对比度学习的代码片段的表示,然后使用跨模式生成任务在编程语言之间对齐表示。我们在九个数据集上的五个代码相关任务上评估了Unixcoder。为了进一步评估代码片段表示的性能,我们还为一个新任务构建了一个数据集,称为零击代码搜索。结果表明,我们的模型在大多数任务上都达到了最先进的性能,分析表明,评论和AST都可以增强Unixcoder。
Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models. However, such encoder-decoder framework is sub-optimal for auto-regressive tasks, especially code completion that requires a decoder-only manner for efficient inference. In this paper, we present UniXcoder, a unified cross-modal pre-trained model for programming language. The model utilizes mask attention matrices with prefix adapters to control the behavior of the model and leverages cross-modal contents like AST and code comment to enhance code representation. To encode AST that is represented as a tree in parallel, we propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree. Furthermore, we propose to utilize multi-modal contents to learn representation of code fragment with contrastive learning, and then align representations among programming languages using a cross-modal generation task. We evaluate UniXcoder on five code-related tasks over nine datasets. To further evaluate the performance of code fragment representation, we also construct a dataset for a new task, called zero-shot code-to-code search. Results show that our model achieves state-of-the-art performance on most tasks and analysis reveals that comment and AST can both enhance UniXcoder.