论文标题

将谷物与谷壳分开:使用数据过滤改善低资源非洲语言的多语言翻译

Separating Grains from the Chaff: Using Data Filtering to Improve Multilingual Translation for Low-Resourced African Languages

论文作者

Abdulmumin, Idris, Beukman, Michael, Alabi, Jesujoba O., Emezue, Chris, Asiko, Everlyn, Adewumi, Tosin, Muhammad, Shamsuddeen Hassan, Adeyemi, Mofetoluwa, Yousuf, Oreen, Singh, Sahib, Gwadabe, Tajuddeen Rabiu

论文摘要

我们参加了非洲语言共享任务的WMT 2022大规模机器翻译评估。这项工作描述了我们的方法,该方法基于使用句子对分类器来过滤给定的嘈杂数据,该分类器是通过微调预训练的语言模型来构建的。为了训练分类器,我们从金标准的策划数据集中获得正样本(即高质量的并行句子),并通过选择低排列得分的句子来从自动比对的数据中提取负样本(即低质量并行句子)。然后,我们的最终机器翻译模型对已过滤数据进行培训,而不是整个嘈杂的数据集。我们通过在两个通用数据集上评估我们的方法来验证我们的方法,并表明数据过滤通常可以改善整体翻译质量,在某些情况下甚至显着。

We participated in the WMT 2022 Large-Scale Machine Translation Evaluation for the African Languages Shared Task. This work describes our approach, which is based on filtering the given noisy data using a sentence-pair classifier that was built by fine-tuning a pre-trained language model. To train the classifier, we obtain positive samples (i.e. high-quality parallel sentences) from a gold-standard curated dataset and extract negative samples (i.e. low-quality parallel sentences) from automatically aligned parallel data by choosing sentences with low alignment scores. Our final machine translation model was then trained on filtered data, instead of the entire noisy dataset. We empirically validate our approach by evaluating on two common datasets and show that data filtering generally improves overall translation quality, in some cases even significantly.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源