论文标题
吊索:大语言模型的中语语言评估
SLING: Sino Linguistic Evaluation of Large Language Models
论文作者
论文摘要
为了了解哪种语言知识是由据审计的中文模型(LMS)编码的,我们介绍了中词语言学(吊索)的基准,该基准由38k最小句子对组成,以普通话为单位,分为9个高级语言现象。每对都证明了特定的句法或语义现象的可接受性对比(例如,键丢失了,而键丢失了),LM应为可接受的句子分配较低的困惑。与攀登数据集(Xiang等,2021)相反,该数据集还包含中国最小的对,是通过翻译英语Blimp数据集的词汇来创建的,吊索中的最小对的对主要是通过将中国态度的striment senderence应用于中国的句子中的句法和词性转换来启用,从而介绍了中国的sendersentess。生成过程。我们在吊索上测试了18种公开验证的单语(例如Bert-Base-ZH,CPM)和多语言(例如MT5,XLM)语言模型。我们的实验表明,LMS的平均准确性远低于人类性能(69.7%vs. 97.1%),而Bert-Base-ZH的平均精度达到了所有经过测试的LMS的最高准确度(84.8%),甚至更大。此外,我们发现大多数LM都具有强大的性别和数字(单数/复数)偏见,并且它们在局部现象上的表现要比等级结构更好。
To understand what kinds of linguistic knowledge are encoded by pretrained Chinese language models (LMs), we introduce the benchmark of Sino LINGuistics (SLING), which consists of 38K minimal sentence pairs in Mandarin Chinese grouped into 9 high-level linguistic phenomena. Each pair demonstrates the acceptability contrast of a specific syntactic or semantic phenomenon (e.g., The keys are lost vs. The keys is lost), and an LM should assign lower perplexity to the acceptable sentence. In contrast to the CLiMP dataset (Xiang et al., 2021), which also contains Chinese minimal pairs and was created by translating the vocabulary of the English BLiMP dataset, the minimal pairs in SLING are derived primarily by applying syntactic and lexical transformations to naturally-occurring, linguist-annotated sentences from the Chinese Treebank 9.0, thus addressing severe issues in CLiMP's data generation process. We test 18 publicly available pretrained monolingual (e.g., BERT-base-zh, CPM) and multi-lingual (e.g., mT5, XLM) language models on SLING. Our experiments show that the average accuracy for LMs is far below human performance (69.7% vs. 97.1%), while BERT-base-zh achieves the highest accuracy (84.8%) of all tested LMs, even much larger ones. Additionally, we find that most LMs have a strong gender and number (singular/plural) bias, and they perform better on local phenomena than hierarchical ones.