动态编程和增强学习中的多基因价值迭代算法

论文标题

动态编程和增强学习中的多基因价值迭代算法

Multiagent Value Iteration Algorithms in Dynamic Programming and Reinforcement Learning

论文作者

Bertsekas, Dimitri

论文摘要

我们考虑了无限的地平线动态编程问题，其中每个阶段的控制由几个不同的决定组成，每个决策都是由几种代理之一做出的。在较早的工作中，我们引入了一种政策迭代算法，该算法在给定的顺序中进行了单个代理的政策改进，并了解了以前的代理商的选择。结果，每个策略改进的计算量随代理的数量线性增长，而不是针对标准的全代理 - 核心方法的指数增长。对于有限国家折扣问题的情况，我们表现出与代理代理最佳政策的融合。在本文中，该结果扩展到了策略迭代的价值和乐观版本，以及更通用的DP问题，在Bellman操作员是收缩映射的情况下，例如所有策略都适当的随机最短路径问题。

We consider infinite horizon dynamic programming problems, where the control at each stage consists of several distinct decisions, each one made by one of several agents. In an earlier work we introduced a policy iteration algorithm, where the policy improvement is done one-agent-at-a-time in a given order, with knowledge of the choices of the preceding agents in the order. As a result, the amount of computation for each policy improvement grows linearly with the number of agents, as opposed to exponentially for the standard all-agents-at-once method. For the case of a finite-state discounted problem, we showed convergence to an agent-by-agent optimal policy. In this paper, this result is extended to value iteration and optimistic versions of policy iteration, as well as to more general DP problems where the Bellman operator is a contraction mapping, such as stochastic shortest path problems with all policies being proper.

下载PDF全文

下载文献需遵守相关版权规定

论文标题