论文标题
对无令牌的多语言预读模型的多维评估
A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models
论文作者
论文摘要
关于无令牌的多语言预算模型的最新工作在改善了跨语性转移和减少工程开销方面表现出令人鼓舞的结果(Clark等,2022; Xue等,2022)。但是,这些作品主要集中于在一组有限的任务和数据设置上报告准确性,在调整和部署模型在实践中(例如内存使用情况,推理速度和微调数据鲁棒性)时,对其他重要因素的重点更少。我们试图通过对这些各个方面的多语言代币和基于子字的模型进行全面的经验比较来填补这一空白。令人惊讶的是,我们发现基于子词的模型可能仍然是许多设置中最实用的选择,可以在较低的推理延迟和内存使用方面取得更好的性能。基于这些结果,我们鼓励未来的无令牌方法工作在设计和评估新模型时考虑这些因素。
Recent work on tokenizer-free multilingual pretrained models show promising results in improving cross-lingual transfer and reducing engineering overhead (Clark et al., 2022; Xue et al., 2022). However, these works mainly focus on reporting accuracy on a limited set of tasks and data settings, placing less emphasis on other important factors when tuning and deploying the models in practice, such as memory usage, inference speed, and fine-tuning data robustness. We attempt to fill this gap by performing a comprehensive empirical comparison of multilingual tokenizer-free and subword-based models considering these various dimensions. Surprisingly, we find that subword-based models might still be the most practical choice in many settings, achieving better performance for lower inference latency and memory usage. Based on these results, we encourage future work in tokenizer-free methods to consider these factors when designing and evaluating new models.