与人类价值观相结合

论文标题

与人类价值观相结合

Aligning AI With Shared Human Values

论文作者

Hendrycks, Dan, Burns, Collin, Basart, Steven, Critch, Andrew, Li, Jerry, Song, Dawn, Steinhardt, Jacob

论文摘要

我们展示了如何评估语言模型对道德基本概念的了解。我们介绍了道德数据集，这是一种新的基准，该基准涵盖了正义，福祉，职责，美德和常识性道德的概念。模型预测了有关各种文本情景的广泛道德判断。这就需要将物理和社会世界知识联系起来，以重视判断，这一功能可能使我们能够引导聊天机器人输出或最终使开放式增援人员正规化。借助伦理数据集，我们发现当前的语言模型具有预测基本人类道德判断的有前途但不完整的能力。我们的工作表明，当今机器伦理可以取得进展，它为AI提供了与人类价值观保持一致的AI的垫脚石。

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

下载PDF全文

下载文献需遵守相关版权规定

论文标题