自适应近似政策迭代

论文标题

自适应近似政策迭代

Adaptive Approximate Policy Iteration

论文作者

Hao, Botao, Lazic, Nevena, Abbasi-Yadkori, Yasin, Joulani, Pooria, Szepesvari, Csaba

论文摘要

与价值函数近似结合使用的无模型增强学习算法最近在各种应用域中取得了令人印象深刻的性能。但是，对这种算法的理论理解是有限的，现有结果主要集中在情节或打折的马尔可夫决策过程（MDP）上。在这项工作中，我们提出了自适应近似政策迭代（AAPI），这是一种学习方案，享受$ \ tilde {o}（t^{2/3}）$遗憾，遗憾的是未遇到的未识别，继续以统一的成真MDP进行学习。这是对$ \ tilde {o}（t^{3/4}）$的最佳现有界限的改进，对于带有功能近似的平均奖励案例。我们的算法和分析依赖于在线学习技术，其中价值功能被视为损失。主要的技术新颖性是使用数据依赖性的自适应学习率，再加上即将到来的损失的所谓乐观预测。除了理论保证外，我们还通过经验证明了方法在几种环境上的优势。

Model-free reinforcement learning algorithms combined with value function approximation have recently achieved impressive performance in a variety of application domains. However, the theoretical understanding of such algorithms is limited, and existing results are largely focused on episodic or discounted Markov decision processes (MDPs). In this work, we present adaptive approximate policy iteration (AAPI), a learning scheme which enjoys a $\tilde{O}(T^{2/3})$ regret bound for undiscounted, continuing learning in uniformly ergodic MDPs. This is an improvement over the best existing bound of $\tilde{O}(T^{3/4})$ for the average-reward case with function approximation. Our algorithm and analysis rely on online learning techniques, where value functions are treated as losses. The main technical novelty is the use of a data-dependent adaptive learning rate coupled with a so-called optimistic prediction of upcoming losses. In addition to theoretical guarantees, we demonstrate the advantages of our approach empirically on several environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题