多批次增强学习的近乎最佳的遗憾范围

论文标题

多批次增强学习的近乎最佳的遗憾范围

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

论文作者

Zhang, Zihan, Jiang, Yuhang, Zhou, Yuan, Ji, Xiangyang

论文摘要

在本文中，我们研究了由有限的马尔可夫决策过程（MDP）建模的情节增强学习（RL）问题，对批处理数量有限制。多批量增强学习框架，在该框架中，代理需要提供时间表以在所有内容之前更新策略，这特别适用于代理商在适应性上更改策略的情况很大的情况。给定有$ s $状态的有限 - 霍森MDP，$ a $ ACTION和PLANCHON $ h $，我们设计了一种计算有效算法，以实现$ \ tilde {o} {o}（\ sqrt {\ sqrt {sah^3k \ ln（1/δ）}} $ \ footnote { $（s，a，h，k）$}的对数项使用$ o \ left（h+\ log_2 \ log_2（k）\ right）$ batches，带有信心参数$Δ$。就我们的最佳知识而言，这是第一个$ \ tilde {o}（\ sqrt {sah^3k}）$ hearle用$ o（h+\ log_2 \ log_2（k））$ batch复杂性。 Meanwhile, we show that to achieve $\tilde{O}(\mathrm{poly}(S,A,H)\sqrt{K})$ regret, the number of batches is at least $Ω\left(H/\log_A(K)+ \log_2\log_2(K) \right)$, which matches our upper bound up to logarithmic terms. 我们的技术贡献是两个方面的：1）一种近乎最佳的设计计划，可以探索未经学习的国家； 2）一种计算有效算法，可通过近似过渡模型探索某些方向。

In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with $S$ states, $A$ actions and planning horizon $H$, we design a computational efficient algorithm to achieve near-optimal regret of $\tilde{O}(\sqrt{SAH^3K\ln(1/δ)})$\footnote{$\tilde{O}(\cdot)$ hides logarithmic terms of $(S,A,H,K)$} in $K$ episodes using $O\left(H+\log_2\log_2(K) \right)$ batches with confidence parameter $δ$. To our best of knowledge, it is the first $\tilde{O}(\sqrt{SAH^3K})$ regret bound with $O(H+\log_2\log_2(K))$ batch complexity. Meanwhile, we show that to achieve $\tilde{O}(\mathrm{poly}(S,A,H)\sqrt{K})$ regret, the number of batches is at least $Ω\left(H/\log_A(K)+ \log_2\log_2(K) \right)$, which matches our upper bound up to logarithmic terms. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题