论文标题
通过最小贝叶斯识别机器翻译指标中的弱点:彗星的案例研究
Identifying Weaknesses in Machine Translation Metrics Through Minimum Bayes Risk Decoding: A Case Study for COMET
论文作者
论文摘要
神经指标与机器翻译系统评估中的人类判断达到了令人印象深刻的相关性,但是在我们可以安全地对这种指标进行优化之前,我们应该意识到(理想地消除)对获得高分的不良翻译的偏见。我们的实验表明,基于样本的最小贝叶斯风险解码可用于探索和量化此类弱点。在将此策略应用于彗星进行ende and de-en时,我们发现彗星模型不足以差异和命名实体差异。我们进一步表明,这些偏见很难通过简单地培训其他合成数据并发布我们的代码和数据以促进进一步的实验,从而完全消除了这些偏见。
Neural metrics have achieved impressive correlation with human judgements in the evaluation of machine translation systems, but before we can safely optimise towards such metrics, we should be aware of (and ideally eliminate) biases toward bad translations that receive high scores. Our experiments show that sample-based Minimum Bayes Risk decoding can be used to explore and quantify such weaknesses. When applying this strategy to COMET for en-de and de-en, we find that COMET models are not sensitive enough to discrepancies in numbers and named entities. We further show that these biases are hard to fully remove by simply training on additional synthetic data and release our code and data for facilitating further experiments.