自信的近似政策迭代，以在$ q^π$ - 真实的MDP中进行有效的本地计划

论文标题

自信的近似政策迭代，以在$ q^π$ - 真实的MDP中进行有效的本地计划

Confident Approximate Policy Iteration for Efficient Local Planning in $q^π$-realizable MDPs

论文作者

Weisz, Gellért, György, András, Kozuno, Tadashi, Szepesvári, Csaba

论文摘要

我们考虑在$γ$删除的马尔可夫决策过程中近似动态编程，并将其应用于具有线性价值函数近似的近似计划。我们的第一个贡献是一种近似政策迭代（API）的新变体，称为自信近似策略迭代（CAPI），该变体计算确定性的固定策略，其最佳误差与有效的地平线$ H $的乘积线性限制，并且最差的案例近似近似错误$ε$ $ε$ $ε$ $ε$。对API的这种改善（其错误尺度$ H^2 $）以$ H $倍的内存成本的价格提高。与Scherrer和Lesner [2012]不同，他们建议计算非平稳政策以实现类似的改进（具有相同的内存开销），我们能够坚持固定的策略。这允许我们的第二个贡献，即CAPI在本地访问模拟器和$ d $维线性函数近似中的应用。因此，我们设计了一种应用CAPI的计划算法，以在动态发展的状态集上获得一系列具有精确精度的策略序列。该算法输出了$ \ tilde o（\ sqrt {d}hε）$ - 发出$ \ tilde o（dh^4/ε^2）$在对模拟器上查询，同时又实现了最佳准确性，并且最出色的Query复杂性绑定了，同时仅在Algorith上遇到了一个最佳的Query复杂性，同时又实现了一个在Algorith中的最佳准确性。除$ h $以外的所有参数中，该查询复杂性在所有参数中都很紧。这些改进是指算法及其输出策略的记忆和计算成本的轻度（多项式）增加。

We consider approximate dynamic programming in $γ$-discounted Markov decision processes and apply it to approximate planning with linear value-function approximation. Our first contribution is a new variant of Approximate Policy Iteration (API), called Confident Approximate Policy Iteration (CAPI), which computes a deterministic stationary policy with an optimal error bound scaling linearly with the product of the effective horizon $H$ and the worst-case approximation error $ε$ of the action-value functions of stationary policies. This improvement over API (whose error scales with $H^2$) comes at the price of an $H$-fold increase in memory cost. Unlike Scherrer and Lesner [2012], who recommended computing a non-stationary policy to achieve a similar improvement (with the same memory overhead), we are able to stick to stationary policies. This allows for our second contribution, the application of CAPI to planning with local access to a simulator and $d$-dimensional linear function approximation. As such, we design a planning algorithm that applies CAPI to obtain a sequence of policies with successively refined accuracies on a dynamically evolving set of states. The algorithm outputs an $\tilde O(\sqrt{d}Hε)$-optimal policy after issuing $\tilde O(dH^4/ε^2)$ queries to the simulator, simultaneously achieving the optimal accuracy bound and the best known query complexity bound, while earlier algorithms in the literature achieve only one of them. This query complexity is shown to be tight in all parameters except $H$. These improvements come at the expense of a mild (polynomial) increase in memory and computational costs of both the algorithm and its output policy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题