论文标题
通过生成模型打破基于模型的增强学习中的样本量障碍
Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model
论文作者
论文摘要
本文涉及增强学习的样本效率,假设访问生成模型(或模拟器)。我们首先考虑使用状态空间$ \ MATHCAL {s} $和ACTION SPACE $ \ MATHCAL {A} $的$γ$ discounted Infinite-Horizon Markov决策过程(MDPS)。尽管有许多解决此问题的先前工作,但尚待确定样本复杂性和统计准确性之间的权衡方面的完整情况。特别是,所有先前的结果都遭受了严重的样本障碍,因为仅当样本量至少超过$ \ frac {| \ Mathcal {s} || \ Mathcal {a} |} |} {(1-γ)^2} $时,其声称的统计保证只有在样本量至少超过$ \ frac {| \ Mathcal {| \ Mathcal {| \ Mathcal {s} | \ Mathcal {s} {| \ Mathcal {| \ Mathcal {| \ Mathcal {| \ Mathcal {| \ Mathcal {| \ Mathcal {(1-γ)$时才能。当前的论文通过证明两种算法的最小值(一种基于模型的算法和一种基于保守的模型算法)来克服了这一障碍 - 一旦样本大小超过$ \ \\ frac {| \ natercal {| \ nathcal {s}} || \ mathcal {a} nog {a} a} |} $}的顺序。我们超越了无限 - 马元MDP,我们还进一步研究了时间固定的有限型摩尼子MDP,并证明,鉴于任何目标精度水平,基于模型的计划算法足以达到最小值 - 最佳样品的复杂性。据我们所知,这项工作提供了第一个最小值的最佳选择,可以保证适应整个样本范围的范围(除此之外,找到有意义的政策是从理论上讲是信息)。
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator). We first consider $γ$-discounted infinite-horizon Markov decision processes (MDPs) with state space $\mathcal{S}$ and action space $\mathcal{A}$. Despite a number of prior works tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, all prior results suffer from a severe sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least $\frac{|\mathcal{S}||\mathcal{A}|}{(1-γ)^2}$. The current paper overcomes this barrier by certifying the minimax optimality of two algorithms -- a perturbed model-based algorithm and a conservative model-based algorithm -- as soon as the sample size exceeds the order of $\frac{|\mathcal{S}||\mathcal{A}|}{1-γ}$ (modulo some log factor). Moving beyond infinite-horizon MDPs, we further study time-inhomogeneous finite-horizon MDPs, and prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level. To the best of our knowledge, this work delivers the first minimax-optimal guarantees that accommodate the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically infeasible).