Safetext：在语言模型中探索物理安全的基准

论文标题

Safetext：在语言模型中探索物理安全的基准

SafeText: A Benchmark for Exploring Physical Safety in Language Models

论文作者

Levy, Sharon, Allaway, Emily, Subbiah, Melanie, Chilton, Lydia, Patton, Desmond, McKeown, Kathleen, Wang, William Yang

论文摘要

了解什么构成安全文本是自然语言处理中的一个重要问题，通常可以防止被视为有害和不安全的模型的部署。几乎没有研究的一种这种安全性是常识性的物理安全，即未明确暴力的文本，需要额外的常识知识以理解它会导致身体伤害。我们创建了第一个基准数据集，即Safetext，其中包括带有配对安全且身体不安全的建议的真实场景。我们利用SafeText来探究用于文本生成和常识性推理任务的各种模型的常识性物理安全。我们发现，最新的大语言模型容易产生不安全的文本，并且很难拒绝不安全的建议。结果，我们主张在释放前对模型中的安全性进行进一步研究和常识性物理安全的评估。

Understanding what constitutes safe text is an important issue in natural language processing and can often prevent the deployment of models deemed harmful and unsafe. One such type of safety that has been scarcely studied is commonsense physical safety, i.e. text that is not explicitly violent and requires additional commonsense knowledge to comprehend that it leads to physical harm. We create the first benchmark dataset, SafeText, comprising real-life scenarios with paired safe and physically unsafe pieces of advice. We utilize SafeText to empirically study commonsense physical safety across various models designed for text generation and commonsense reasoning tasks. We find that state-of-the-art large language models are susceptible to the generation of unsafe text and have difficulty rejecting unsafe advice. As a result, we argue for further studies of safety and the assessment of commonsense physical safety in models before release.

下载PDF全文

下载文献需遵守相关版权规定

论文标题