确定性MDP的所有人的上限和最大增益政策迭代算法

论文标题

确定性MDP的所有人的上限和最大增益政策迭代算法

Upper Bounds for All and Max-gain Policy Iteration Algorithms on Deterministic MDPs

论文作者

Goenka, Ritesh, Gupta, Eashan, Khyalia, Sushil, Agarwal, Pratyush, Wajid, Mulinti Shaik, Kalyanakrishnan, Shivaram

论文摘要

政策迭代（PI）是一种广泛使用的算法系列，用于计算马尔可夫决策问题（MDP）的最佳政策。我们在确定性MDP（DMDP）上的PI运行时间上得出上限：MDPS类，其中每个州行动对具有唯一的下一个状态。我们的结果包括适用于整个PI算法家族的非平凡上限；所有“ Max-Gain”开关变体的另一个；并肯定DMDP对MDP上的Howard PI的猜想是正确的。我们的分析基于某些可能具有独立关注的图理论结果。

Policy Iteration (PI) is a widely used family of algorithms to compute optimal policies for Markov Decision Problems (MDPs). We derive upper bounds on the running time of PI on Deterministic MDPs (DMDPs): the class of MDPs in which every state-action pair has a unique next state. Our results include a non-trivial upper bound that applies to the entire family of PI algorithms; another to all "max-gain" switching variants; and affirmation that a conjecture regarding Howard's PI on MDPs is true for DMDPs. Our analysis is based on certain graph-theoretic results, which may be of independent interest.

下载PDF全文

下载文献需遵守相关版权规定

论文标题