通过生成模型打破基于模型的增强学习中的样本量障碍

论文标题

通过生成模型打破基于模型的增强学习中的样本量障碍

Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

论文作者

Li, Gen, Wei, Yuting, Chi, Yuejie, Chen, Yuxin

论文摘要

本文涉及增强学习的样本效率，假设访问生成模型（或模拟器）。我们首先考虑使用状态空间$ \ MATHCAL {s} $和ACTION SPACE $ \ MATHCAL {A} $的$γ$ discounted Infinite-Horizon Markov决策过程（MDPS）。尽管有许多解决此问题的先前工作，但尚待确定样本复杂性和统计准确性之间的权衡方面的完整情况。特别是，所有先前的结果都遭受了严重的样本障碍，因为仅当样本量至少超过$ \ frac {| \ Mathcal {s} || \ Mathcal {a} |} |} {（1-γ）^2} $时，其声称的统计保证只有在样本量至少超过$ \ frac {| \ Mathcal {| \ Mathcal {| \ Mathcal {s} | \ Mathcal {s} {| \ Mathcal {| \ Mathcal {| \ Mathcal {| \ Mathcal {| \ Mathcal {| \ Mathcal {（1-γ）$时才能。当前的论文通过证明两种算法的最小值（一种基于模型的算法和一种基于保守的模型算法）来克服了这一障碍 - 一旦样本大小超过$ \ \\ frac {| \ natercal {| \ nathcal {s}} || \ mathcal {a} nog {a} a} |} $}的顺序。我们超越了无限 - 马元MDP，我们还进一步研究了时间固定的有限型摩尼子MDP，并证明，鉴于任何目标精度水平，基于模型的计划算法足以达到最小值 - 最佳样品的复杂性。据我们所知，这项工作提供了第一个最小值的最佳选择，可以保证适应整个样本范围的范围（除此之外，找到有意义的政策是从理论上讲是信息）。

This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator). We first consider $γ$-discounted infinite-horizon Markov decision processes (MDPs) with state space $\mathcal{S}$ and action space $\mathcal{A}$. Despite a number of prior works tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy is yet to be determined. In particular, all prior results suffer from a severe sample size barrier, in the sense that their claimed statistical guarantees hold only when the sample size exceeds at least $\frac{|\mathcal{S}||\mathcal{A}|}{(1-γ)^2}$. The current paper overcomes this barrier by certifying the minimax optimality of two algorithms -- a perturbed model-based algorithm and a conservative model-based algorithm -- as soon as the sample size exceeds the order of $\frac{|\mathcal{S}||\mathcal{A}|}{1-γ}$ (modulo some log factor). Moving beyond infinite-horizon MDPs, we further study time-inhomogeneous finite-horizon MDPs, and prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level. To the best of our knowledge, this work delivers the first minimax-optimal guarantees that accommodate the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically infeasible).

下载PDF全文

下载文献需遵守相关版权规定

论文标题