论文标题
部分可观测时空混沌系统的无模型预测
Measuring Reliability of Large Language Models through Semantic Consistency
论文作者
论文摘要
虽然大型的语言模型(PLM)在许多自然语言任务上表现出令人难以置信的流利性和表现,但最近的工作表明,表现出色的PLM对提示的提示非常敏感。即使提示在语义上相同,语言模型也可能给出非常不同的答案。在考虑PLM的安全和值得信赖的部署时,我们希望它们的输出在指示同一事物或传达相同意图的提示下保持一致。尽管某些工作已经研究了最新的PLM如何满足这种需求,但它们仅限于评估单词或多字答案的词汇平等,并且不能解决生成文本序列的一致性。为了了解文本生成设置下PLM的一致性,我们开发了一种语义一致性的度量,可以比较开放式文本输出。我们实施了该一致性指标的几个版本,以评估真实性数据集中的释义版本的许多PLM的性能,我们发现我们的提议指标比传统的指标比体现词法一致性的传统指标更一致,并且与人体评估的产出一致性相关。
While large pretrained language models (PLMs) demonstrate incredible fluency and performance on many natural language tasks, recent work has shown that well-performing PLMs are very sensitive to what prompts are feed into them. Even when prompts are semantically identical, language models may give very different answers. When considering safe and trustworthy deployments of PLMs we would like their outputs to be consistent under prompts that mean the same thing or convey the same intent. While some work has looked into how state-of-the-art PLMs address this need, they have been limited to only evaluating lexical equality of single- or multi-word answers and do not address consistency of generative text sequences. In order to understand consistency of PLMs under text generation settings, we develop a measure of semantic consistency that allows the comparison of open-ended text outputs. We implement several versions of this consistency metric to evaluate the performance of a number of PLMs on paraphrased versions of questions in the TruthfulQA dataset, we find that our proposed metrics are considerably more consistent than traditional metrics embodying lexical consistency, and also correlate with human evaluation of output consistency to a higher degree.