论文标题
令牌化对语言模型的影响:土耳其分析
Impact of Tokenization on Language Models: An Analysis for Turkish
论文作者
论文摘要
令牌化是为深层语言模型准备输入令牌的重要文本预处理步骤。 WordPiece和BPE实际上是由Bert和GPT等重要模型采用的方法。但是,对于形态上丰富的语言,例如突变语言,令牌化的影响可能会有所不同,在这些语言上,可以通过添加前缀和后缀来产生许多单词。我们比较了不同粒度级别的五个令牌,即它们的输出从最小的字符到单词的表面形式不等,包括形态级别的令牌。我们使用罗伯塔(Roberta)在奥斯卡语料库的土耳其分裂上使用罗伯塔(Roberta)预处理程序来训练这些令牌和中型语言模型。然后,我们将模型调整为六个下游任务。我们的实验在统计检验的支持下,表明形态级别的令牌仪具有挑战性的事实引物。此外,我们发现,增加词汇量的大小比事实上的标记者更能提高形态和单词级别的象征的性能。对于事实上的引物,词汇参数的数量与模型参数总数的比率可以被经验选择为20%,而对于其他标记者来说,可以选择40%,以获得模型大小和性能之间的合理权衡。
Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, where many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, i.e. their outputs vary from smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks. Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off between model size and performance.