具有一些潜在环境的奖励混音MDP是可以学习的

论文标题

具有一些潜在环境的奖励混音MDP是可以学习的

Reward-Mixing MDPs with a Few Latent Contexts are Learnable

论文作者

Kwon, Jeongyeol, Efroni, Yonathan, Caramanis, Constantine, Mannor, Shie

论文摘要

我们考虑在奖励混合马尔可夫决策过程（RMMDPS）中进行情节性强化学习：在每一集的开头，自然可以随机选择$ M $候选人之间的潜在奖励模型，而在整个情节中，代理商在$ h $时间段内与MDP互动。我们的目标是学习一项近乎最佳的政策，该政策几乎使$ h $ h $ time-Step累积的奖励最大化。以前的工作以$ M = 2 $为RMMDPS建立了上限。在这项工作中，我们解决了RMMDP模型仍然存在的几个开放问题。对于任意的$ m \ ge2 $，我们提供了一种样本效率算法 - $ \ texttt {em}^2 $ - 使用$ \ tilde {o} {o} \ left（ε^{ - 2} \ cdot s^d a^d a^d a^d a^d a^d a^d a^d a^d a^$ cdot \ z} poltoy texttt（Z） $ s，a $的情节分别是国家和行动的数量，$ h $是时间率，$ z $是奖励发行的支持大小，$ d = \ min（2m-1，h）$。我们的技术是基于方法的方法的高阶扩展，但是，\ algname算法的设计和分析还需要现有技术以外的几个新想法。我们还为RMMDP的一般实例提供了$（sa）^{ω（\ sqrt {m}）} /ε^{2} $的下限，以支持$ m $中的超级多种样本复杂性。

We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among $M$ candidates and an agent interacts with the MDP throughout the episode for $H$ time steps. Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model. Previous work established an upper bound for RMMDPs for $M=2$. In this work, we resolve several open questions remained for the RMMDP model. For an arbitrary $M\ge2$, we provide a sample-efficient algorithm--$\texttt{EM}^2$--that outputs an $ε$-optimal policy using $\tilde{O} \left(ε^{-2} \cdot S^d A^d \cdot \texttt{poly}(H, Z)^d \right)$ episodes, where $S, A$ are the number of states and actions respectively, $H$ is the time-horizon, $Z$ is the support size of reward distributions and $d=\min(2M-1,H)$. Our technique is a higher-order extension of the method-of-moments based approach, nevertheless, the design and analysis of the \algname algorithm requires several new ideas beyond existing techniques. We also provide a lower bound of $(SA)^{Ω(\sqrt{M})} / ε^{2}$ for a general instance of RMMDP, supporting that super-polynomial sample complexity in $M$ is necessary.

下载PDF全文

下载文献需遵守相关版权规定

论文标题