潜在马尔可夫决策过程的无水平和方差依赖于方差的增强学习

论文标题

潜在马尔可夫决策过程的无水平和方差依赖于方差的增强学习

Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes

论文作者

Zhou, Runlong, Wang, Ruosong, Du, Simon S.

论文摘要

我们研究了潜在的马尔可夫决策过程（LMDPS）的遗憾最小化强化学习（RL），并在后视上进行了背景。我们设计了一种新型的基于模型的算法框架，可以与模型令人震惊的和价值的求解器进行实例化。 We prove an $\tilde{O}(\sqrt{\mathsf{Var}^\star M ΓS A K})$ regret bound where $\tilde{O}$ hides logarithm factors, $M$ is the number of contexts, $S$ is the number of states, $A$ is the number of actions, $K$ is the number of episodes, $Γ\le S$是任何州行动对的最大过渡度，以及$ \ Mathsf {var}^\ star $是描述LMDP的确定性的方差数量。遗憾仅与计划范围的对数缩放，从而产生了第一个（几乎）无视的遗憾。这也是第一个针对LMDP的问题依赖的遗憾。我们证明的关键是分析α向量的总方差（价值函数的概括），该方差用截断方法处理。我们通过新颖的$ω（\ sqrt {\ mathsf {var}^\ star m a k}）$遗憾的下bunder带有$γ= 2 $的下限，这表明我们的上限minimax Optimal是$γ$是$γ$的常数，这表明带有差异的lmdps的常数。我们的下限依赖于硬实例的新结构以及受理论计算机科学的对称技术启发的论点，这两者在技术上与MDP的现有下限证明在技术上不同，因此可以具有独立的兴趣。

We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight. We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver. We prove an $\tilde{O}(\sqrt{\mathsf{Var}^\star M ΓS A K})$ regret bound where $\tilde{O}$ hides logarithm factors, $M$ is the number of contexts, $S$ is the number of states, $A$ is the number of actions, $K$ is the number of episodes, $Γ\le S$ is the maximum transition degree of any state-action pair, and $\mathsf{Var}^\star$ is a variance quantity describing the determinism of the LMDP. The regret bound only scales logarithmically with the planning horizon, thus yielding the first (nearly) horizon-free regret bound for LMDP. This is also the first problem-dependent regret bound for LMDP. Key in our proof is an analysis of the total variance of alpha vectors (a generalization of value functions), which is handled with a truncation method. We complement our positive result with a novel $Ω(\sqrt{\mathsf{Var}^\star M S A K})$ regret lower bound with $Γ= 2$, which shows our upper bound minimax optimal when $Γ$ is a constant for the class of variance-bounded LMDPs. Our lower bound relies on new constructions of hard instances and an argument inspired by the symmetrization technique from theoretical computer science, both of which are technically different from existing lower bound proof for MDPs, and thus can be of independent interest.

下载PDF全文

下载文献需遵守相关版权规定

论文标题