几乎最小的MDP的最佳最佳加固学习

论文标题

几乎最小的MDP的最佳最佳加固学习

Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs

论文作者

He, Jiafan, Zhou, Dongruo, Gu, Quanquan

论文摘要

我们研究了在表格设置下的折扣马尔可夫决策过程（MDP）的强化学习问题。我们提出了一种名为ucbvi-$γ$的基于模型的算法，该算法基于\ emph {面对不确定性原理的乐观}和伯恩斯坦型奖励。我们表明，ucbvi- $γ$成就了$ \ tilde {o} \ big（{{\ sqrt {sqrt {sat}}/{（1-γ）^{1.5}} \ big）$遗憾，其中$ s $是国家的数量，$ a $ a $ a $ a $ a $是$γ$的数量，是$ uptive corption and $ t;是$ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t $ t》是$ t $ t $ t。此外，我们构建了一类硬MDP，并表明对于任何算法，预期的遗憾至少为$ \tildeΩ\ big（{\ sqrt {sat}}}}/{（1-γ）^{1.5}}}} \ big）$。我们的上限与对数因素的最小值下限匹配，这表明UCBVI- $γ$对于折扣的MDP几乎是最佳的最佳选择。

We study the reinforcement learning problem for discounted Markov Decision Processes (MDPs) under the tabular setting. We propose a model-based algorithm named UCBVI-$γ$, which is based on the \emph{optimism in the face of uncertainty principle} and the Bernstein-type bonus. We show that UCBVI-$γ$ achieves an $\tilde{O}\big({\sqrt{SAT}}/{(1-γ)^{1.5}}\big)$ regret, where $S$ is the number of states, $A$ is the number of actions, $γ$ is the discount factor and $T$ is the number of steps. In addition, we construct a class of hard MDPs and show that for any algorithm, the expected regret is at least $\tildeΩ\big({\sqrt{SAT}}/{(1-γ)^{1.5}}\big)$. Our upper bound matches the minimax lower bound up to logarithmic factors, which suggests that UCBVI-$γ$ is nearly minimax optimal for discounted MDPs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题