在BLEU中纠结：重新评估自动机器翻译评估指标的评估

论文标题

在BLEU中纠结：重新评估自动机器翻译评估指标的评估

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

论文作者

Mathur, Nitika, Baldwin, Timothy, Cohn, Trevor

论文摘要

自动指标对于机器翻译系统的开发和评估至关重要。判断自动指标是否与人类评估的黄金标准同意并不是一个简单的问题。我们表明，当前判断指标的方法对用于评估的翻译非常敏感，尤其是异常值的存在，这通常会导致对度量标准功效的错误结论。最后，我们转向成对系统排名，开发一种针对人类判断的自动指标下的阈值绩效提高方法，该方法允许对I型与I类型误差进行量化，即所接受的II型错误，即所接受的系统质量的微不足道的人类差异，以及被拒绝的重大人类差异。总之，这些发现表明改进了机器翻译中的指标评估和系统性能评估的协议。

Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem. We show that current methods for judging metrics are highly sensitive to the translations used for assessment, particularly the presence of outliers, which often leads to falsely confident conclusions about a metric's efficacy. Finally, we turn to pairwise system ranking, developing a method for thresholding performance improvement under an automatic metric against human judgements, which allows quantification of type I versus type II errors incurred, i.e., insignificant human differences in system quality that are accepted, and significant human differences that are rejected. Together, these findings suggest improvements to the protocols for metric evaluation and system performance evaluation in machine translation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题