论文标题
提示GPT-3可靠
Prompting GPT-3 To Be Reliable
论文作者
论文摘要
大型语言模型(LLMS)通过很少的提示显示出令人印象深刻的能力。商业化的API(例如OpenAI GPT-3)进一步增加了其在现实世界语言应用中的使用。但是,如何提高GPT-3的可靠性的关键问题仍然不足。虽然可靠性是一个广泛且含糊不清的术语,但我们将可靠性分解为四个主要方面,这些方面与现有的ML安全框架相对应,并且被众所周知很重要:概括性,社会偏见,校准和事实。我们的核心贡献是建立简单有效的提示,以提高GPT-3的可靠性:1)概括分布情况,2)平衡人口统计分布并使用自然语言指示来减少社交偏见,3)校准输出概率,4)4)更新LLM的事实知识和推理链。在适当的提示下,GPT-3比所有这些方面的较小规模的监督模型更可靠。我们发布所有处理的数据集,评估脚本和模型预测。我们的系统实证研究不仅为提示LLM的可靠性提供了新的见解,而且更重要的是,我们的促进策略可以帮助从业者更可靠地使用LLM,例如GPT-3。
Large language models (LLMs) show impressive abilities via few-shot prompting. Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world language applications. However, the crucial problem of how to improve the reliability of GPT-3 is still under-explored. While reliability is a broad and vaguely defined term, we decompose reliability into four main facets that correspond to the existing framework of ML safety and are well-recognized to be important: generalizability, social biases, calibration, and factuality. Our core contribution is to establish simple and effective prompts that improve GPT-3's reliability as it: 1) generalizes out-of-distribution, 2) balances demographic distribution and uses natural language instructions to reduce social biases, 3) calibrates output probabilities, and 4) updates the LLM's factual knowledge and reasoning chains. With appropriate prompts, GPT-3 is more reliable than smaller-scale supervised models on all these facets. We release all processed datasets, evaluation scripts, and model predictions. Our systematic empirical study not only sheds new insights on the reliability of prompting LLMs, but more importantly, our prompting strategies can help practitioners more reliably use LLMs like GPT-3.