论文标题
低资源词对齐的子字采样
Subword Sampling for Low Resource Word Alignment
论文作者
论文摘要
注释投影是NLP中重要的领域,可以极大地促进为低资源语言创建语言资源。单词对齐在此环境中起关键作用。但是,大多数现有的单词对齐方法都是为机器翻译中的高资源设置而设计的,其中数百万平行句子可用。当处理现有已建立的IBM模型失败的低资源语言时,这一数量减少了数千句话。在本文中,我们提出了基于子字样的文本单元对齐。该方法的假设是,某些语言对的不同文本粒度的汇总可以帮助单词级别对齐。对于某些金标准对齐的语言,我们提出了一个迭代的贝叶斯优化框架,以优化从源和目标句子可能的子字表示的空间中选择可能的子字。我们表明,子词抽样方法始终在六对语言上胜过单词级别的对齐方式:英语 - 德语,英语 - 法国,英语 - 罗马尼亚语,英语 - 塞里安,英语印度语和英语inuktitut。此外,我们表明,在某些语言对中学到的超参数可以在不监督下应用于其他语言,并始终如一地改善对齐结果。我们观察到,使用$ 5K $并行句子以及我们提出的子词采样方法,我们获得了与现有Word级快速差异/eFlomal Arignment方法中使用$ 100K $的并行句子的使用相似的F1分数。
Annotation projection is an important area in NLP that can greatly contribute to creating language resources for low-resource languages. Word alignment plays a key role in this setting. However, most of the existing word alignment methods are designed for a high resource setting in machine translation where millions of parallel sentences are available. This amount reduces to a few thousands of sentences when dealing with low-resource languages failing the existing established IBM models. In this paper, we propose subword sampling-based alignment of text units. This method's hypothesis is that the aggregation of different granularities of text for certain language pairs can help word-level alignment. For certain languages for which gold-standard alignments exist, we propose an iterative Bayesian optimization framework to optimize selecting possible subwords from the space of possible subword representations of the source and target sentences. We show that the subword sampling method consistently outperforms word-level alignment on six language pairs: English-German, English-French, English-Romanian, English-Persian, English-Hindi, and English-Inuktitut. In addition, we show that the hyperparameters learned for certain language pairs can be applied to other languages at no supervision and consistently improve the alignment results. We observe that using $5K$ parallel sentences together with our proposed subword sampling approach, we obtain similar F1 scores to the use of $100K$'s of parallel sentences in existing word-level fast-align/eflomal alignment methods.