论文标题

关于多语言的经济学几次学习:建模机器翻译和手动数据的成本绩效权衡取舍

On the Economics of Multilingual Few-shot Learning: Modeling the Cost-Performance Trade-offs of Machine Translated and Manual Data

论文作者

Ahuja, Kabir, Choudhury, Monojit, Dandapat, Sandipan

论文摘要

从微观经济学中借用{\ em生产功能}的想法,在本文中,我们介绍了一个框架,以系统地评估机器翻译和手动创建的标记数据之间的性能和成本权衡,以供大量多种语言模型进行任务特定于任务的微调。我们通过对Tydiqa-Goldp数据集的案例研究来说明我们的框架的有效性。该研究的有趣结论之一是,如果机器翻译的成本大于零,则至少有一些或仅仅是手动创建的数据,最佳性能至少总是可以实现的。据我们所知,这是扩展生产功能概念以研究培训多语言模型的数据收集策略的首次尝试,并且可以作为NLP中其他类似成本与数据权衡的宝贵工具。

Borrowing ideas from {\em Production functions} in micro-economics, in this paper we introduce a framework to systematically evaluate the performance and cost trade-offs between machine-translated and manually-created labelled data for task-specific fine-tuning of massively multilingual language models. We illustrate the effectiveness of our framework through a case-study on the TyDIQA-GoldP dataset. One of the interesting conclusions of the study is that if the cost of machine translation is greater than zero, the optimal performance at least cost is always achieved with at least some or only manually-created data. To our knowledge, this is the first attempt towards extending the concept of production functions to study data collection strategies for training multilingual models, and can serve as a valuable tool for other similar cost vs data trade-offs in NLP.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源