论文标题

具有一些潜在环境的奖励混音MDP是可以学习的

Reward-Mixing MDPs with a Few Latent Contexts are Learnable

论文作者

Kwon, Jeongyeol, Efroni, Yonathan, Caramanis, Constantine, Mannor, Shie

论文摘要

我们考虑在奖励混合马尔可夫决策过程(RMMDPS)中进行情节性强化学习:在每一集的开头,自然可以随机选择$ M $候选人之间的潜在奖励模型,而在整个情节中,代理商在$ h $时间段内与MDP互动。我们的目标是学习一项近乎最佳的政策,该政策几乎使$ h $ h $ time-Step累积的奖励最大化。以前的工作以$ M = 2 $为RMMDPS建立了上限。在这项工作中,我们解决了RMMDP模型仍然存在的几个开放问题。对于任意的$ m \ ge2 $,我们提供了一种样本效率算法 - $ \ texttt {em}^2 $ - 使用$ \ tilde {o} {o} \ left(ε^{ - 2} \ cdot s^d a^d a^d a^d a^d a^d a^d a^d a^d a^$ cdot \ z} poltoy texttt(Z) $ s,a $的情节分别是国家和行动的数量,$ h $是时间率,$ z $是奖励发行的支持大小,$ d = \ min(2m-1,h)$。我们的技术是基于方法的方法的高阶扩展,但是,\ algname算法的设计和分析还需要现有技术以外的几个新想法。我们还为RMMDP的一般实例提供了$(sa)^{ω(\ sqrt {m})} /ε^{2} $的下限,以支持$ m $中的超级多种样本复杂性。

We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among $M$ candidates and an agent interacts with the MDP throughout the episode for $H$ time steps. Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model. Previous work established an upper bound for RMMDPs for $M=2$. In this work, we resolve several open questions remained for the RMMDP model. For an arbitrary $M\ge2$, we provide a sample-efficient algorithm--$\texttt{EM}^2$--that outputs an $ε$-optimal policy using $\tilde{O} \left(ε^{-2} \cdot S^d A^d \cdot \texttt{poly}(H, Z)^d \right)$ episodes, where $S, A$ are the number of states and actions respectively, $H$ is the time-horizon, $Z$ is the support size of reward distributions and $d=\min(2M-1,H)$. Our technique is a higher-order extension of the method-of-moments based approach, nevertheless, the design and analysis of the \algname algorithm requires several new ideas beyond existing techniques. We also provide a lower bound of $(SA)^{Ω(\sqrt{M})} / ε^{2}$ for a general instance of RMMDP, supporting that super-polynomial sample complexity in $M$ is necessary.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源