论文标题
基于BERT的评估指标的可重复性问题
Reproducibility Issues for BERT-based Evaluation Metrics
论文作者
论文摘要
可重复性在机器学习和自然语言处理(NLP)中是最关注的。在自然语言生成的领域(尤其是机器翻译),《 Post的开创性论文》(2018年)指出了出版时占主导地位的BLEU的可重复性问题。如今,基于BERT的评估指标的表现要大大超过BLEU。在本文中,我们询问是否可以再现了四个基于BERT的指标的结果和主张。我们发现,索赔和结果的再现通常失败,因为(i)指标涉及的大量无证件预处理,(ii)缺失代码和(iii)报告基线指标的结果较弱。 (iv)在一种情况下,问题源于与人分数相关的,而是与CSV文件中的错误列相关联,将分数夸大5分。在预处理的影响下,我们进行了第二项研究,在该研究中,我们更仔细地研究了它的效果(其中一个指标)。我们发现预处理可能会产生很大的影响,尤其是对于高度拐点的语言。在这种情况下,预处理的效果可能大于聚集机理的效果(例如,贪婪的比对与单词搬运距离)。
Reproducibility is of utmost concern in machine learning and natural language processing (NLP). In the field of natural language generation (especially machine translation), the seminal paper of Post (2018) has pointed out problems of reproducibility of the dominant metric, BLEU, at the time of publication. Nowadays, BERT-based evaluation metrics considerably outperform BLEU. In this paper, we ask whether results and claims from four recent BERT-based metrics can be reproduced. We find that reproduction of claims and results often fails because of (i) heavy undocumented preprocessing involved in the metrics, (ii) missing code and (iii) reporting weaker results for the baseline metrics. (iv) In one case, the problem stems from correlating not to human scores but to a wrong column in the csv file, inflating scores by 5 points. Motivated by the impact of preprocessing, we then conduct a second study where we examine its effects more closely (for one of the metrics). We find that preprocessing can have large effects, especially for highly inflectional languages. In this case, the effect of preprocessing may be larger than the effect of the aggregation mechanism (e.g., greedy alignment vs. Word Mover Distance).