论文标题

部分可观测时空混沌系统的无模型预测

Stabilizing Q-learning with Linear Architectures for Provably Efficient Learning

论文作者

Zanette, Andrea, Wainwright, Martin J.

论文摘要

$ q $ - 学习算法是一种简单且广泛使用的随机近似方案,用于增强学习,但是基本协议可以与功能近似结合使用不稳定。即使使用线性函数近似,也可以观察到这种不稳定。实际上,诸如目标网络和经验重播之类的工具似乎是必不可少的,但是从理论上讲,这些机制中每种机制的个人贡献都不是很好的。这项工作提出了具有线性函数近似的基本$ q $ - 学习协议的探索变体。我们的模块化分析说明了我们采用的每种算法工具的作用:第二阶更新规则,一组目标网络以及类似于体验重播的机制。他们在一起使最先进的状态在线性MDP上遗憾,同时保留了该算法最突出的特征,即空间复杂性独立于经过的步骤数量。我们表明,在近似误差的新颖和更宽松的概念下,该算法的性能非常优雅地降解。该算法还表现出实例依赖性的形式,因为其性能取决于“有效”特征维度。

The $Q$-learning algorithm is a simple and widely-used stochastic approximation scheme for reinforcement learning, but the basic protocol can exhibit instability in conjunction with function approximation. Such instability can be observed even with linear function approximation. In practice, tools such as target networks and experience replay appear to be essential, but the individual contribution of each of these mechanisms is not well understood theoretically. This work proposes an exploration variant of the basic $Q$-learning protocol with linear function approximation. Our modular analysis illustrates the role played by each algorithmic tool that we adopt: a second order update rule, a set of target networks, and a mechanism akin to experience replay. Together, they enable state of the art regret bounds on linear MDPs while preserving the most prominent feature of the algorithm, namely a space complexity independent of the number of step elapsed. We show that the performance of the algorithm degrades very gracefully under a novel and more permissive notion of approximation error. The algorithm also exhibits a form of instance-dependence, in that its performance depends on the "effective" feature dimension.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源