多语言清单：发电和评估

论文标题

多语言清单：发电和评估

Multilingual CheckList: Generation and Evaluation

论文作者

K, Karthikeyan, Bhatt, Shaily, Singh, Pankaj, Aditya, Somak, Dandapat, Sandipan, Sitaram, Sunayana, Choudhury, Monojit

论文摘要

多语言评估基准通常包含有限的高资源语言，并且不针对特定语言能力测试模型。清单是一种基于模板的评估方法，可测试模型的特定功能。清单模板创建过程需要以母语为母语的人，在扩展到数百种语言方面提出了挑战。在这项工作中，我们探讨了生成多语言清单的多种方法。我们设备了一种算法 - 模板提取算法（TEA），用于自动从机器翻译的源语言模板实例中提取目标语言清单模板。我们将茶清单与以不同级别的人干预措施创建的清单进行比较。我们进一步介绍了比较清单的成本，多样性，实用性和正确性的度量标准。我们彻底分析了在印地语中创建清单的不同方法。此外，我们尝试了9种不同的语言。我们发现，随后进行人体验证的茶是将基于清单的评估扩展到多种语言的理想选择，而TEA可以很好地估算模型性能。

Multilingual evaluation benchmarks usually contain limited high-resource languages and do not test models for specific linguistic capabilities. CheckList is a template-based evaluation approach that tests models for specific capabilities. The CheckList template creation process requires native speakers, posing a challenge in scaling to hundreds of languages. In this work, we explore multiple approaches to generate Multilingual CheckLists. We device an algorithm - Template Extraction Algorithm (TEA) for automatically extracting target language CheckList templates from machine translated instances of a source language templates. We compare the TEA CheckLists with CheckLists created with different levels of human intervention. We further introduce metrics along the dimensions of cost, diversity, utility, and correctness to compare the CheckLists. We thoroughly analyze different approaches to creating CheckLists in Hindi. Furthermore, we experiment with 9 more different languages. We find that TEA followed by human verification is ideal for scaling Checklist-based evaluation to multiple languages while TEA gives a good estimates of model performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题