论文标题
扣:语义解析的很少的跨语性数据扩展
CLASP: Few-Shot Cross-Lingual Data Augmentation for Semantic Parsing
论文作者
论文摘要
开发语义解析(SP)模型的瓶颈是需要大量人类标记的训练数据。鉴于SP的人类注释的复杂性和成本,标记的数据通常很少,尤其是在多语言环境中。大型语言模型(LLMS)在SP上出现了一些示例,但是LLM不适合需要低延迟的运行时系统。在这项工作中,我们提出了CLASP,这是一种简单的方法,可以改善适度模型的低资源SP:我们从AlexATM 20B生成合成数据,以增强针对较小型号的40x型号(500m参数)的训练集。我们在低资源设置中的两个数据集上评估了英语披萨,包含348或16个真实示例,以及MTOP跨语言零照片,其中仅以英语提供培训数据,并且该模型必须推广到四种新语言。在两个数据集上,我们对强基线方法均显示出显着改善。
A bottleneck to developing Semantic Parsing (SP) models is the need for a large volume of human-labeled training data. Given the complexity and cost of human annotation for SP, labeled data is often scarce, particularly in multilingual settings. Large Language Models (LLMs) excel at SP given only a few examples, however LLMs are unsuitable for runtime systems which require low latency. In this work, we propose CLASP, a simple method to improve low-resource SP for moderate-sized models: we generate synthetic data from AlexaTM 20B to augment the training set for a model 40x smaller (500M parameters). We evaluate on two datasets in low-resource settings: English PIZZA, containing either 348 or 16 real examples, and mTOP cross-lingual zero-shot, where training data is available only in English, and the model must generalize to four new languages. On both datasets, we show significant improvements over strong baseline methods.