论文标题
WMT 2020
Volctrans Parallel Corpus Filtering System for WMT 2020
论文作者
论文摘要
在本文中,我们描述了有关低资源条件的WMT20共享任务的提交。任务要求参与者从给定的文档对中对齐潜在的并行句子对,并对其进行评分,以便可以过滤低质量对。我们的系统volctrans由两个模块,即采矿模块和一个评分模块制成。基于单词对齐模型,采矿模块采用了迭代采矿策略来提取潜在的并行句子。在评分模块中,基于XLM的得分手提供得分,其次是重新依据的机制和集合。我们的提交的表现优于基线3.x/2.x和2.x/2.x,用于km-en和从头开始/微调条件上的PS-en,这是所有提交中最高的。
In this paper, we describe our submissions to the WMT20 shared task on parallel corpus filtering and alignment for low-resource conditions. The task requires the participants to align potential parallel sentence pairs out of the given document pairs, and score them so that low-quality pairs can be filtered. Our system, Volctrans, is made of two modules, i.e., a mining module and a scoring module. Based on the word alignment model, the mining module adopts an iterative mining strategy to extract latent parallel sentences. In the scoring module, an XLM-based scorer provides scores, followed by reranking mechanisms and ensemble. Our submissions outperform the baseline by 3.x/2.x and 2.x/2.x for km-en and ps-en on From Scratch/Fine-Tune conditions, which is the highest among all submissions.