论文标题

PYMT5:具有变压器的自然语言和Python代码的多模式翻译

PyMT5: multi-mode translation of natural language and Python code with transformers

论文作者

Clement, Colin B., Drain, Dawn, Timcheck, Jonathan, Svyatkovskiy, Alexey, Sundaresan, Neel

论文摘要

同时建模源代码和自然语言在自动化软件开发和理解中具有许多令人兴奋的应用程序。根据实现此类技术,我们介绍了Python方法Text-Toxt Transfers Transformer Pymt5,该方法经过培训,可以在所有对Python方法组合之间进行翻译:一种单个模型,可以预测自然语言文档文档字符串(Docstrings)(Docstrings)的整个方法,并将代码汇总到任何常见样式的文档中。我们提出了2600万个Python方法和770万种方法串串对的大规模平行语料库的分析和建模工作,这表明,对于DocString和方法生成,PYMT5的表现优于类似大小的自动注册语言模型(GPT2),这些模型(GPT2)是英语预培养或随机初始初始初始初始初始初始初始化。在CodesearchNet测试集上,我们的最佳模型预测了92.1%的句法正确方法,方法生成的BLEU得分为8.59,而DocString生成(摘要)的BLEU得分为16.3,并获得了方法生成24.8的Rouge-L F-Score和36.7的生成,并获得36.7。

Simultaneously modeling source code and natural language has many exciting applications in automated software development and understanding. Pursuant to achieving such technology, we introduce PyMT5, the Python method text-to-text transfer transformer, which is trained to translate between all pairs of Python method feature combinations: a single model that can both predict whole methods from natural language documentation strings (docstrings) and summarize code into docstrings of any common style. We present an analysis and modeling effort of a large-scale parallel corpus of 26 million Python methods and 7.7 million method-docstring pairs, demonstrating that for docstring and method generation, PyMT5 outperforms similarly-sized auto-regressive language models (GPT2) which were English pre-trained or randomly initialized. On the CodeSearchNet test set, our best model predicts 92.1% syntactically correct method bodies, achieved a BLEU score of 8.59 for method generation and 16.3 for docstring generation (summarization), and achieved a ROUGE-L F-score of 24.8 for method generation and 36.7 for docstring generation.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源