论文标题

什么时候可以实现非政策加强学习?

When is Realizability Sufficient for Off-Policy Reinforcement Learning?

论文作者

Zanette, Andrea

论文摘要

用于加固学习的无模型算法通常需要一种称为Bellman完整性的条件,以便通过功能近似成功地操作外部电池,除非满足其他条件。但是,Bellman的完整性是一项比实现性要强大得多的要求,被认为太强大了,无法在实践中坚持。在这项工作中,我们放宽了这种结构假设,并分析了仅在规定的功能类中实现可靠性时,非政策增强学习的统计复杂性。 我们建立了有限样本的保证,用于违反近似误差术语,称为固有的贝尔曼误差,这取决于三个因素的相互作用。前两个是众所周知的:它们是函数类别的度量熵,也是代表学习非政策成本的浓缩系数。第三个因素是新因素,它衡量了违反贝尔曼完整性的行为,即通过贝尔曼操作员选择的功能类及其图像之间的错误对准。 从本质上讲,这些误差范围表明,即使没有贝尔曼的完整性,也可以在统计上进行统计学上的钢筋学习,并表征有利的贝尔曼完整设置与最差的案例场景之间的中间情况,在这些情况下,指数下限是有效的。我们的分析直接适用于时间差算法收敛时发现的解决方案。

Model-free algorithms for reinforcement learning typically require a condition called Bellman completeness in order to successfully operate off-policy with function approximation, unless additional conditions are met. However, Bellman completeness is a requirement that is much stronger than realizability and that is deemed to be too strong to hold in practice. In this work, we relax this structural assumption and analyze the statistical complexity of off-policy reinforcement learning when only realizability holds for the prescribed function class. We establish finite-sample guarantees for off-policy reinforcement learning that are free of the approximation error term known as inherent Bellman error, and that depend on the interplay of three factors. The first two are well known: they are the metric entropy of the function class and the concentrability coefficient that represents the cost of learning off-policy. The third factor is new, and it measures the violation of Bellman completeness, namely the mis-alignment between the chosen function class and its image through the Bellman operator. In essence, these error bounds establish that off-policy reinforcement learning remains statistically viable even in absence of Bellman completeness, and characterize the intermediate situation between the favorable Bellman complete setting and the worst-case scenario where exponential lower bounds are in force. Our analysis directly applies to the solution found by temporal difference algorithms when they converge.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源