论文标题
线性函数近似的无奖励增强学习中的近乎最佳部署效率
Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning with Linear Function Approximation
论文作者
论文摘要
我们研究了在\ emph {无奖励}探索设置下使用线性函数近似的部署有效增强学习(RL)的问题。这是一个充分动机的问题,因为在现实生活中,部署新政策的成本是昂贵的。在具有特征尺寸$ d $的线性MDP设置下,我们提出了一种新算法,最多收集$ \ \ widetilde {o}(\ frac {\ frac {d^2H^5} {ε^2} {ε^2} {ε^2})$ traightores $ h $内部的$ H $部署以确定$ $ - $ - $ - $ - $ - $ - $ - $ - $ - $ - $ - $ - $ - $ -OPTIM的策略(可能)。据我们所知,我们的方法是第一个同时实现最佳部署复杂性和最佳$ d $依赖性的方法,即使提前知道奖励。我们的新技术包括提供探索的政策离散化和广义的G-最佳实验设计,这可能具有独立的兴趣。最后,我们分析了低自适应RL中遗憾最小化的相关问题,并为切换成本和批处理复杂性提供信息理论下限。
We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the \emph{reward-free} exploration setting. This is a well-motivated problem because deploying new policies is costly in real-life RL applications. Under the linear MDP setting with feature dimension $d$ and planning horizon $H$, we propose a new algorithm that collects at most $\widetilde{O}(\frac{d^2H^5}{ε^2})$ trajectories within $H$ deployments to identify $ε$-optimal policy for any (possibly data-dependent) choice of reward functions. To the best of our knowledge, our approach is the first to achieve optimal deployment complexity and optimal $d$ dependence in sample complexity at the same time, even if the reward is known ahead of time. Our novel techniques include an exploration-preserving policy discretization and a generalized G-optimal experiment design, which could be of independent interest. Lastly, we analyze the related problem of regret minimization in low-adaptive RL and provide information-theoretic lower bounds for switching cost and batch complexity.