通过迭代改善效用功能，AGI代理安全

论文标题

通过迭代改善效用功能，AGI代理安全

AGI Agent Safety by Iteratively Improving the Utility Function

论文作者

Holtman, Koen

论文摘要

虽然尚不清楚是否可以构建具有人工通用智能（AGI）的代理商，但我们已经可以使用数学模型来研究这些试剂的潜在安全系统。我们提出了一个AGI安全层，该层创建了一个特殊的专用输入终端，以支持AGI代理效用函数的迭代改进。打开代理商的人可以使用该终端来关闭实用程序函数对代理目标和约束的编码中发现的任何漏洞，以将代理引向新目标，或者迫使代理商关闭。 AGI代理可能会发展出紧急的动力来操纵上述效用功能改进过程，例如，通过欺骗，限制甚至攻击所涉及的人。安全层将部分，有时甚至完全抑制这种危险的激励措施。本文的第一部分概括了AGI紧急停止按钮的早期工作。我们的目的是通过将其应用于MDP模型，使数学方法用于构建图层。我们讨论了安全层的两个可证明的特性，并显示了将其映射到因果影响图（CID）中的持续工作。在第二部分中，我们开发了完整的数学证明，并表明安全层产生了一种官僚主义的失明。然后，我们介绍了学习代理的设计，该设计将安全层围绕已知的机器学习系统或潜在的未来Agi级学习系统缠绕。从首次打开的那一刻起，最终的代理将满足可证明的安全性能。最后，我们展示了如何将该代理从其模型映射到现实生活实现。我们回顾了此步骤中涉及的方法论问题，并讨论了通常如何解决这些问题。

While it is still unclear if agents with Artificial General Intelligence (AGI) could ever be built, we can already use mathematical models to investigate potential safety systems for these agents. We present an AGI safety layer that creates a special dedicated input terminal to support the iterative improvement of an AGI agent's utility function. The humans who switched on the agent can use this terminal to close any loopholes that are discovered in the utility function's encoding of agent goals and constraints, to direct the agent towards new goals, or to force the agent to switch itself off. An AGI agent may develop the emergent incentive to manipulate the above utility function improvement process, for example by deceiving, restraining, or even attacking the humans involved. The safety layer will partially, and sometimes fully, suppress this dangerous incentive. The first part of this paper generalizes earlier work on AGI emergency stop buttons. We aim to make the mathematical methods used to construct the layer more accessible, by applying them to an MDP model. We discuss two provable properties of the safety layer, and show ongoing work in mapping it to a Causal Influence Diagram (CID). In the second part, we develop full mathematical proofs, and show that the safety layer creates a type of bureaucratic blindness. We then present the design of a learning agent, a design that wraps the safety layer around either a known machine learning system, or a potential future AGI-level learning system. The resulting agent will satisfy the provable safety properties from the moment it is first switched on. Finally, we show how this agent can be mapped from its model to a real-life implementation. We review the methodological issues involved in this step, and discuss how these are typically resolved.

下载PDF全文

下载文献需遵守相关版权规定

论文标题