WMT 2020

论文标题

Volctrans Parallel Corpus Filtering System for WMT 2020

论文作者

Xu, Runxin, Zhi, Zhuo, Cao, Jun, Wang, Mingxuan, Li, Lei

论文摘要

在本文中，我们描述了有关低资源条件的WMT20共享任务的提交。任务要求参与者从给定的文档对中对齐潜在的并行句子对，并对其进行评分，以便可以过滤低质量对。我们的系统volctrans由两个模块，即采矿模块和一个评分模块制成。基于单词对齐模型，采矿模块采用了迭代采矿策略来提取潜在的并行句子。在评分模块中，基于XLM的得分手提供得分，其次是重新依据的机制和集合。我们的提交的表现优于基线3.x/2.x和2.x/2.x，用于km-en和从头开始/微调条件上的PS-en，这是所有提交中最高的。

In this paper, we describe our submissions to the WMT20 shared task on parallel corpus filtering and alignment for low-resource conditions. The task requires the participants to align potential parallel sentence pairs out of the given document pairs, and score them so that low-quality pairs can be filtered. Our system, Volctrans, is made of two modules, i.e., a mining module and a scoring module. Based on the word alignment model, the mining module adopts an iterative mining strategy to extract latent parallel sentences. In the scoring module, an XLM-based scorer provides scores, followed by reranking mechanisms and ensemble. Our submissions outperform the baseline by 3.x/2.x and 2.x/2.x for km-en and ps-en on From Scratch/Fine-Tune conditions, which is the highest among all submissions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题

WMT 2020

Volctrans Parallel Corpus Filtering System for WMT 2020

论文作者

论文摘要

加入微信交流群