论文标题
评估提交消息生成:到BLEU还是不去BLEU?
Evaluating Commit Message Generation: To BLEU Or Not To BLEU?
论文作者
论文摘要
提交消息在几个软件工程任务中起着重要作用,例如程序理解和理解程序演变。但是,程序员忽略了写出好的提交消息。因此,已经提出了几种提交消息生成(CMG)工具。我们观察到,最近的CMG工具的最新状态使用了简单易于计算的自动评估指标,例如BLEU4或其变体。机器翻译(MT)领域的进步表明BLEU4及其变体的几个弱点。他们还提出了其他几种评估自然语言生成(NLG)工具的指标。在这项工作中,我们讨论了各种MT指标对CMG任务的适用性。基于我们实验的见解,我们提出了一种专门用于评估CMG任务的新变体。我们在新指标上重新评估了CMG工具的状态。我们认为,我们的工作修复了对CMG研究评估指标的理解时存在的重要差距。
Commit messages play an important role in several software engineering tasks such as program comprehension and understanding program evolution. However, programmers neglect to write good commit messages. Hence, several Commit Message Generation (CMG) tools have been proposed. We observe that the recent state of the art CMG tools use simple and easy to compute automated evaluation metrics such as BLEU4 or its variants. The advances in the field of Machine Translation (MT) indicate several weaknesses of BLEU4 and its variants. They also propose several other metrics for evaluating Natural Language Generation (NLG) tools. In this work, we discuss the suitability of various MT metrics for the CMG task. Based on the insights from our experiments, we propose a new variant specifically for evaluating the CMG task. We re-evaluate the state of the art CMG tools on our new metric. We believe that our work fixes an important gap that exists in the understanding of evaluation metrics for CMG research.