句子级别置信度估计机器翻译中的数据麻烦

论文标题

句子级别置信度估计机器翻译中的数据麻烦

Data Troubles in Sentence Level Confidence Estimation for Machine Translation

论文作者

Chelba, Ciprian, Zhou, Junpei, Yuezhang, Li, Kazawa, Hideto, Klingner, Jeff, Niu, Mengmeng

论文摘要

本文研究了在性能频谱高端运行的神经机器翻译模型的置信度估计的可行性。作为构建此类模型所必需的数据注释过程的辅助产品，我们提出句子级别的准确性$ sacc $作为一种简单的，自称的评估度量，以用于翻译质量。在两个不同的注释池中进行的实验，一个由非专家（人群）组成，专家（专业）翻译人员之一表明，$ sacc $可能会因注释者的翻译能力而变化很大，尽管这两个池根据Krippendorff的Alpha的Alpha Metric，这两个池都是同样可靠的；通道间协议的相对较低的值证实了句子级别标签$ $ / $ SEEDS \ WORK \ WORK $ for for Act上下文的期望非常困难。对于以$ sacc = 0.89 $运行的英语 - 西班牙翻译模型，根据非专家注释池，我们可以得出一个信心估算，该估计标签为$ $ $ translations的0.5-0.6在“内域”测试集中，具有0.95的精度。切换到专家注释池大大减少了$ SACC $：英语 - 西班牙语的$ 0.61 $，以与上述完全相同的数据来衡量。这迫使我们将CE模型操作点降低到0.9精度，同时正确标记了数据中$ GOOD $翻译的0.20-0.25。我们发现CE依赖于用于标记数据的注释池的水平水平的程度令人惊讶。这会导致我们希望在实践中解决CE建模时要提出的重要建议：与最终用户对所需域中翻译质量的期望与将二进制质量标签分配给CE培训数据的注释者的需求相匹配。

The paper investigates the feasibility of confidence estimation for neural machine translation models operating at the high end of the performance spectrum. As a side product of the data annotation process necessary for building such models we propose sentence level accuracy $SACC$ as a simple, self-explanatory evaluation metric for quality of translation. Experiments on two different annotator pools, one comprised of non-expert (crowd-sourced) and one of expert (professional) translators show that $SACC$ can vary greatly depending on the translation proficiency of the annotators, despite the fact that both pools are about equally reliable according to Krippendorff's alpha metric; the relatively low values of inter-annotator agreement confirm the expectation that sentence-level binary labeling $good$ / $needs\ work$ for translation out of context is very hard. For an English-Spanish translation model operating at $SACC = 0.89$ according to a non-expert annotator pool we can derive a confidence estimate that labels 0.5-0.6 of the $good$ translations in an "in-domain" test set with 0.95 Precision. Switching to an expert annotator pool decreases $SACC$ dramatically: $0.61$ for English-Spanish, measured on the exact same data as above. This forces us to lower the CE model operating point to 0.9 Precision while labeling correctly about 0.20-0.25 of the $good$ translations in the data. We find surprising the extent to which CE depends on the level of proficiency of the annotator pool used for labeling the data. This leads to an important recommendation we wish to make when tackling CE modeling in practice: it is critical to match the end-user expectation for translation quality in the desired domain with the demands of annotators assigning binary quality labels to CE training data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题