用于基于功能的增强学习的插件求解器样品有效吗？

论文标题

用于基于功能的增强学习的插件求解器样品有效吗？

Is Plug-in Solver Sample-Efficient for Feature-based Reinforcement Learning?

论文作者

Cui, Qiwen, Yang, Lin F.

论文摘要

据认为，基于模型的增强学习方法（RL）是降低样本复杂性的关键。但是，即使对于线性案例，对基于模型的RL的样品最优性的理解仍然在很大程度上缺失。这项工作认为在马尔可夫决策过程（MDP）中查找$ε$最佳策略的样本复杂性，该策略接受了线性加性功能表示，仅访问生成模型。我们通过插件求解器方法解决了此问题，该方法通过任意插件求解器在此经验模型中构建了经验模型并计划。 We prove that under the anchor-state assumption, which implies implicit non-negativity in the feature space, the minimax sample complexity of finding an $ε$-optimal policy in a $γ$-discounted MDP is $O(K/(1-γ)^3ε^2)$, which only depends on the dimensionality $K$ of the feature space and has no dependence on the state or action space.我们将结果进一步扩展到一个不存在锚固状态的放松设置，并表明插件方法也可以有效地进行样品，从而为RL设计基于模型的算法提供了灵活的方法。

It is believed that a model-based approach for reinforcement learning (RL) is the key to reduce sample complexity. However, the understanding of the sample optimality of model-based RL is still largely missing, even for the linear case. This work considers sample complexity of finding an $ε$-optimal policy in a Markov decision process (MDP) that admits a linear additive feature representation, given only access to a generative model. We solve this problem via a plug-in solver approach, which builds an empirical model and plans in this empirical model via an arbitrary plug-in solver. We prove that under the anchor-state assumption, which implies implicit non-negativity in the feature space, the minimax sample complexity of finding an $ε$-optimal policy in a $γ$-discounted MDP is $O(K/(1-γ)^3ε^2)$, which only depends on the dimensionality $K$ of the feature space and has no dependence on the state or action space. We further extend our results to a relaxed setting where anchor-states may not exist and show that a plug-in approach can be sample efficient as well, providing a flexible approach to design model-based algorithms for RL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题