线性函数近似的无奖励增强学习中的近乎最佳部署效率

论文标题

线性函数近似的无奖励增强学习中的近乎最佳部署效率

Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning with Linear Function Approximation

论文作者

Qiao, Dan, Wang, Yu-Xiang

论文摘要

我们研究了在\ emph {无奖励}探索设置下使用线性函数近似的部署有效增强学习（RL）的问题。这是一个充分动机的问题，因为在现实生活中，部署新政策的成本是昂贵的。在具有特征尺寸$ d $的线性MDP设置下，我们提出了一种新算法，最多收集$ \ \ widetilde {o}（\ frac {\ frac {d^2H^5} {ε^2} {ε^2} {ε^2}）$ traightores $ h $内部的$ H $部署以确定$ $ - $ - $ - $ - $ - $ - $ - $ - $ - $ - $ - $ - $ - $ -OPTIM的策略（可能）。据我们所知，我们的方法是第一个同时实现最佳部署复杂性和最佳$ d $依赖性的方法，即使提前知道奖励。我们的新技术包括提供探索的政策离散化和广义的G-最佳实验设计，这可能具有独立的兴趣。最后，我们分析了低自适应RL中遗憾最小化的相关问题，并为切换成本和批处理复杂性提供信息理论下限。

We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the \emph{reward-free} exploration setting. This is a well-motivated problem because deploying new policies is costly in real-life RL applications. Under the linear MDP setting with feature dimension $d$ and planning horizon $H$, we propose a new algorithm that collects at most $\widetilde{O}(\frac{d^2H^5}{ε^2})$ trajectories within $H$ deployments to identify $ε$-optimal policy for any (possibly data-dependent) choice of reward functions. To the best of our knowledge, our approach is the first to achieve optimal deployment complexity and optimal $d$ dependence in sample complexity at the same time, even if the reward is known ahead of time. Our novel techniques include an exploration-preserving policy discretization and a generalized G-optimal experiment design, which could be of independent interest. Lastly, we analyze the related problem of regret minimization in low-adaptive RL and provide information-theoretic lower bounds for switching cost and batch complexity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题