论文标题

线性混合物MDP的计算有效的无水平增强学习

Computationally Efficient Horizon-Free Reinforcement Learning for Linear Mixture MDPs

论文作者

Zhou, Dongruo, Gu, Quanquan

论文摘要

最近的研究表明,即使有较长的计划范围和未知的状态过渡,情节增强学习(RL)也比上下文匪徒更加困难。但是,这些结果仅限于单曲Markov决策过程(MDP)或线性混合物MDP的计算算法无效算法。在本文中,我们提出了第一个用于线性混合物MDP的计算有效的无水平算法,该算法实现了最佳的$ \ tilde o(d \ sqrt {k} +d^2)$遗憾的是对数因素。我们的算法适应了未知过渡动态的加权最小平方估计器,其中权重既是\ emph {roriance-ware} and \ emph {normph {undnectity-aware}。将我们加权的最小平方估计器应用于异构线性匪徒时,我们可以获得$ \ tilde o(d \ sqrt {\ sum_ {\ sum_ {k = 1}^kσ_k^2} +d)$在第一个$ k $ younds中遗憾当已知$σ_k^2 $时,在这种情况下,这也可以改善最著名的算法。

Recent studies have shown that episodic reinforcement learning (RL) is not more difficult than contextual bandits, even with a long planning horizon and unknown state transitions. However, these results are limited to either tabular Markov decision processes (MDPs) or computationally inefficient algorithms for linear mixture MDPs. In this paper, we propose the first computationally efficient horizon-free algorithm for linear mixture MDPs, which achieves the optimal $\tilde O(d\sqrt{K} +d^2)$ regret up to logarithmic factors. Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic, where the weight is both \emph{variance-aware} and \emph{uncertainty-aware}. When applying our weighted least square estimator to heterogeneous linear bandits, we can obtain an $\tilde O(d\sqrt{\sum_{k=1}^K σ_k^2} +d)$ regret in the first $K$ rounds, where $d$ is the dimension of the context and $σ_k^2$ is the variance of the reward in the $k$-th round. This also improves upon the best-known algorithms in this setting when $σ_k^2$'s are known.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源