论文标题
通过线性函数近似,几乎最小的最佳加固学习
Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation
论文作者
论文摘要
我们使用线性函数近似研究增强学习,其中过渡概率和奖励函数相对于特征映射$ \ boldsymbolx(s,a)$是线性的。具体而言,我们考虑情节不均匀的线性马尔可夫决策过程(MDP),并提出了一种新颖的计算有效算法,lsvi-ucb $^+$,它实现了$ \ widetilde {o} {o}(o}(hd \ sqrt {t})$ bound $ hd $ h $ h $是$是$ disteption,$ d是$ d in $ d dim $ d, LSVI-UCB $^+$以伯恩斯坦类型的探索奖金建立在加权山脊回归和上限价值迭代的基础上。我们的统计结果是通过新颖的分析工具获得的,包括新的伯恩斯坦自称为椭圆电位的保守主义,并对校正项进行了完善分析。这是线性MDP的最小算法,直到对数因素,它关闭了$ \ sqrt {hd} $差距,$ \ widetilde {o}的上限(\ sqrt {\ sqrt {h^3d^3t})$线性MDPS的$ω(HD \ SQRT {T})$。
We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping $\boldsymbolϕ(s,a)$. Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB$^+$, which achieves an $\widetilde{O}(Hd\sqrt{T})$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps. LSVI-UCB$^+$ builds on weighted ridge regression and upper confidence value iteration with a Bernstein-type exploration bonus. Our statistical results are obtained with novel analytical tools, including a new Bernstein self-normalized bound with conservatism on elliptical potentials, and refined analysis of the correction term. This is a minimax optimal algorithm for linear MDPs up to logarithmic factors, which closes the $\sqrt{Hd}$ gap between the upper bound of $\widetilde{O}(\sqrt{H^3d^3T})$ in (Jin et al., 2020) and lower bound of $Ω(Hd\sqrt{T})$ for linear MDPs.