何时做例外：探索语言模型作为人类道德判断的叙述

论文标题

何时做例外：探索语言模型作为人类道德判断的叙述

When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment

论文作者

Jin, Zhijing, Levine, Sydney, Gonzalez, Fernando, Kamal, Ojasv, Sap, Maarten, Sachan, Mrinmaya, Mihalcea, Rada, Tenenbaum, Josh, Schölkopf, Bernhard

论文摘要

人工智能系统正越来越多地与人类的生活交织在一起。为了有效地与人类合作并确保安全，AI系统需要能够理解，解释和预测人类的道德判断和决策。人的道德判断通常以规则为指导，但并非总是如此。人工智能安全的核心挑战是捕捉人类道德思想的灵活性 - 确定何时应打破规则的能力，尤其是在新颖或不寻常的情况下。在本文中，我们提出了一个新颖的挑战集，其中包括涉及潜在允许的规则违反规则的案件的规则问答（RBQA），灵感来自最近的道德心理学研究。我们将最先进的大语言模型（LLM）作为基础，我们提出了一种新颖的道德思想链（Moralcot）提示策略，将LLM的优势与认知科学中发展的道德推理理论相结合，以预测人类的道德判断。 Moralcot的表现优于现有的LLM 6.2％F1，这表明对人类道德思想的灵活性可能是必要的。我们还进行了详细的错误分析，以建议使用RBQA提高AI安全的未来工作方向。我们的数据通过https://huggingface.co/datasets/feradauto/moralexceptqa进行开源，并在https://github.com/feradauto/moralcot上进行代码。

AI systems are becoming increasingly intertwined with human life. In order to effectively collaborate with humans and ensure safety, AI systems need to be able to understand, interpret and predict human moral judgments and decisions. Human moral judgments are often guided by rules, but not always. A central challenge for AI safety is capturing the flexibility of the human moral mind -- the ability to determine when a rule should be broken, especially in novel or unusual situations. In this paper, we present a novel challenge set consisting of rule-breaking question answering (RBQA) of cases that involve potentially permissible rule-breaking -- inspired by recent moral psychology studies. Using a state-of-the-art large language model (LLM) as a basis, we propose a novel moral chain of thought (MORALCOT) prompting strategy that combines the strengths of LLMs with theories of moral reasoning developed in cognitive science to predict human moral judgments. MORALCOT outperforms seven existing LLMs by 6.2% F1, suggesting that modeling human reasoning might be necessary to capture the flexibility of the human moral mind. We also conduct a detailed error analysis to suggest directions for future work to improve AI safety using RBQA. Our data is open-sourced at https://huggingface.co/datasets/feradauto/MoralExceptQA and code at https://github.com/feradauto/MoralCoT

下载PDF全文

下载文献需遵守相关版权规定

论文标题