论文标题

通过在线实验设计,线性MDP中依赖实例依赖性的近乎最佳策略标识

Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design

论文作者

Wagenmaker, Andrew, Jamieson, Kevin

论文摘要

尽管在理解增强学习的最小值复杂性(RL)(在“最坏情况”实例上学习的复杂性)方面已经取得了很多进展,但这种复杂性的衡量标准通常不会捕捉到真正的学习困难。在实践中,在“简单”的情况下,我们可能希望获得比最糟糕的实例可以实现的复杂性要好得多。在这项工作中,我们试图理解在具有线性函数近似的RL设置中学习近乎最佳策略(PAC RL)的“实例依赖性”复杂性。我们提出了一种算法,\ textsc {pedel},该算法实现了一个细粒度的实例依赖性的复杂性度量,这是RL中的第一个具有功能近似设置,从而捕获了每个特定问题实例的学习困难。通过一个明确的示例,我们表明\ textsc {pedel}可以比低重新格雷,最小值 - 最佳算法获得可证明的收益,并且这种算法无法达到实例 - 最佳速率。我们的方法依赖于一种基于在线设计的新型程序,该程序将勘探预算重点放在与学习近乎最佳政策最相关的“方向”上,并且可能具有独立的兴趣。

While much progress has been made in understanding the minimax sample complexity of reinforcement learning (RL) -- the complexity of learning on the "worst-case" instance -- such measures of complexity often do not capture the true difficulty of learning. In practice, on an "easy" instance, we might hope to achieve a complexity far better than that achievable on the worst-case instance. In this work we seek to understand the "instance-dependent" complexity of learning near-optimal policies (PAC RL) in the setting of RL with linear function approximation. We propose an algorithm, \textsc{Pedel}, which achieves a fine-grained instance-dependent measure of complexity, the first of its kind in the RL with function approximation setting, thereby capturing the difficulty of learning on each particular problem instance. Through an explicit example, we show that \textsc{Pedel} yields provable gains over low-regret, minimax-optimal algorithms and that such algorithms are unable to hit the instance-optimal rate. Our approach relies on a novel online experiment design-based procedure which focuses the exploration budget on the "directions" most relevant to learning a near-optimal policy, and may be of independent interest.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源