论文标题

修补透明剂中的泄漏,以提高角色级生成

Patching Leaks in the Charformer for Efficient Character-Level Generation

论文作者

Edman, Lukas, Toral, Antonio, van Noord, Gertjan

论文摘要

基于字符的表示比基于子词的形态丰富的语言具有重要优势。它们具有增加嘈杂输入的鲁棒性,并且不需要单独的令牌化步骤。但是,它们也具有至关重要的劣势:它们显着增加了文本序列的长度。来自Charformer组(又称下样本)字符的GBST方法来解决此问题,但允许将信息应用于变压器解码器时泄漏。我们解决了此信息泄漏问题,从而在解码器中启用了字符分组。我们表明,在翻译质量方面,与以前的降采样方法相比,NMT的变化量没有明显的好处,但是可以更快地培训大约30%。在英语上的有希望的表现 - 乌尔克语翻译表明了字符级模型对形态富裕语言的潜力。

Character-based representations have important advantages over subword-based ones for morphologically rich languages. They come with increased robustness to noisy input and do not need a separate tokenization step. However, they also have a crucial disadvantage: they notably increase the length of text sequences. The GBST method from Charformer groups (aka downsamples) characters to solve this, but allows information to leak when applied to a Transformer decoder. We solve this information leak issue, thereby enabling character grouping in the decoder. We show that Charformer downsampling has no apparent benefits in NMT over previous downsampling methods in terms of translation quality, however it can be trained roughly 30% faster. Promising performance on English--Turkish translation indicate the potential of character-level models for morphologically-rich languages.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源