从内在动机到占据动作状态路径空间的复杂行为

论文标题

从内在动机到占据动作状态路径空间的复杂行为

Complex behavior from intrinsic motivation to occupy action-state path space

论文作者

Ramírez-Ruiz, Jorge, Grytskyy, Dmytro, Mastrogiuseppe, Chiara, Habib, Yamen, Moreno-Bote, Rubén

论文摘要

大多数行为理论认为，代理商倾向于最大化某种形式的奖励或效用。但是，动物经常好奇地移动，并且似乎以无奖励的方式动机。在这里，我们放弃了奖励最大化的想法，并建议行为的目标是最大程度地提高行动和国家未来道路的占用。根据这一最大占用原则，奖励是占据路径空间的手段，而不是目标本身的手段。目标指导只是作为寻找资源的理性方式而出现的，因此，该运动充分理解，永无止境。我们发现，动作状态路径熵是唯一与预期的未来动作状态路径占用的添加性和其他直观特性一致的量度。我们提供了将最佳策略和国家价值函数关联的分析表达式，并证明了我们的价值迭代算法的收敛性。使用离散和连续的状态任务，包括高维控制器，我们表明，诸如“舞蹈”，捉迷藏的复杂行为以及利他行为的基本形式自然是由于占用路径空间的内在动机而自然产生的。总而言之，我们提出了一种行为理论，该理论在没有奖励最大化的情况下同时产生可变性和目标指导性。

Most theories of behavior posit that agents tend to maximize some form of reward or utility. However, animals very often move with curiosity and seem to be motivated in a reward-free manner. Here we abandon the idea of reward maximization, and propose that the goal of behavior is maximizing occupancy of future paths of actions and states. According to this maximum occupancy principle, rewards are the means to occupy path space, not the goal per se; goal-directedness simply emerges as rational ways of searching for resources so that movement, understood amply, never ends. We find that action-state path entropy is the only measure consistent with additivity and other intuitive properties of expected future action-state path occupancy. We provide analytical expressions that relate the optimal policy and state-value function, and prove convergence of our value iteration algorithm. Using discrete and continuous state tasks, including a high--dimensional controller, we show that complex behaviors such as `dancing', hide-and-seek and a basic form of altruistic behavior naturally result from the intrinsic motivation to occupy path space. All in all, we present a theory of behavior that generates both variability and goal-directedness in the absence of reward maximization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题