链值的链值功能

论文标题

链值的链值功能

Chaining Value Functions for Off-Policy Learning

论文作者

Schmitt, Simon, Shawe-Taylor, John, van Hasselt, Hado

论文摘要

为了积累知识并改善其行为政策，强化学习者可以学习与用于产生其经验的政策不同的政策。这对于学习反事实很重要，或者是因为经验是出于自身的控制而产生的。但是，非政策学习是非平凡的，标准加强学习算法可能是不稳定和不同的。在本文中，我们讨论了一个新型的非政策预测算法系列，该算法是通过结构收敛的。这个想法是要首先学习有关数据生成行为的政策，然后在此式估算中引导一个非政策价值估算值，从而构建了一个部分非政策的价值估计值。可以重复此过程以构建价值函数链，每次引导对链中的先前估计值进行新的估计值。链中的每个步骤都是稳定的，因此保证完整的算法是稳定的。在轻度条件下，当我们增加链的长度时，这是任意接近底盘TD解决方案的。因此，即使在违反Policy TD差异的情况下，它也可以计算解决方案。我们证明所提出的方案是收敛的，并且对应于反密钥矩阵的迭代分解。此外，它可以解释为估计一个新的目标 - 我们称之为“ K-STEP探险” - 遵循有限多步骤的目标政策，然后无限期继续执行行为政策。从经验上讲，我们评估了有关挑战MDP的想法，例如Baird的反例，并观察到有利的结果。

To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because the experience was generated out of its own control. However, off-policy learning is non-trivial, and standard reinforcement-learning algorithms can be unstable and divergent. In this paper we discuss a novel family of off-policy prediction algorithms which are convergent by construction. The idea is to first learn on-policy about the data-generating behaviour, and then bootstrap an off-policy value estimate on this on-policy estimate, thereby constructing a value estimate that is partially off-policy. This process can be repeated to build a chain of value functions, each time bootstrapping a new estimate on the previous estimate in the chain. Each step in the chain is stable and hence the complete algorithm is guaranteed to be stable. Under mild conditions this comes arbitrarily close to the off-policy TD solution when we increase the length of the chain. Hence it can compute the solution even in cases where off-policy TD diverges. We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix. Furthermore it can be interpreted as estimating a novel objective -- that we call a `k-step expedition' -- of following the target policy for finitely many steps before continuing indefinitely with the behaviour policy. Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题