论文标题

几乎最小的MDP的最佳最佳加固学习

Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

论文作者

He, Jiafan, Zhou, Dongruo, Gu, Quanquan

论文摘要

我们研究了在表格设置下的折扣马尔可夫决策过程(MDP)的强化学习问题。我们提出了一种名为ucbvi-$γ$的基于模型的算法,该算法基于\ emph {面对不确定性原理的乐观}和伯恩斯坦型奖励。我们表明,ucbvi- $γ$成就了$ \ tilde {o} \ big({{\ sqrt {sqrt {sat}}/{(1-γ)^{1.5}} \ big)$遗憾,其中$ s $是国家的数量,$ a $ a $ a $ a $ a $是$γ$的数量,是$ uptive corption and $ t;是$ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t》是$ t $ t $ t。此外,我们构建了一类硬MDP,并表明对于任何算法,预期的遗憾至少为$ \tildeΩ\ big({\ sqrt {sat}}}}/{(1-γ)^{1.5}}}} \ big)$。我们的上限与对数因素的最小值下限匹配,这表明UCBVI- $γ$对于折扣的MDP几乎是最佳的最佳选择。

We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$γ$, which is based on the \emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-$γ$ achieves an $\tilde{O}\big({\sqrt{SAT}}/{(1-γ)^{1.5}}\big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $γ$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $\tildeΩ\big({\sqrt{SAT}}/{(1-γ)^{1.5}}\big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$γ$ is nearly minimax optimal for discounted MDPs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源