论文标题
一个广义的引导目标,用于价值学习,有效地结合价值和特征预测
A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions
论文作者
论文摘要
估计值函数是增强学习算法的核心组成部分。时间差异(TD)学习算法使用引导程序,即它们在随后的时间阶段使用值估计来将值函数更新为学习目标。另外,可以将值函数更新为通过单独预测后继功能(SF)(策略依赖性模型)构建的学习目标,并将它们与瞬时奖励相结合。我们专注于估计值函数时使用的引导目标,并提出了一个新的备份目标,即$η$ - 返回的混合物,该混合物将价值预测性知识(TD方法使用)与(继任者可预测性的知识)隐含地结合在一起,并在特征可预测性的知识中 - 并在参数$η$中捕获每种内容。我们说明,与Extreme相比,通过$ηγ$删除的SF模型合并预测性知识可以更有效地利用采样体验,即完全根据价值函数估算值或单独估计的后继功能和瞬时奖励模型进行引导。我们从经验上表明,对于表格和非线性函数近似值,表明可伸缩性和通用性,这将导致更快的策略评估和更好的控制性能。
Estimating value functions is a core component of reinforcement learning algorithms. Temporal difference (TD) learning algorithms use bootstrapping, i.e. they update the value function toward a learning target using value estimates at subsequent time-steps. Alternatively, the value function can be updated toward a learning target constructed by separately predicting successor features (SF)--a policy-dependent model--and linearly combining them with instantaneous rewards. We focus on bootstrapping targets used when estimating value functions, and propose a new backup target, the $η$-return mixture, which implicitly combines value-predictive knowledge (used by TD methods) with (successor) feature-predictive knowledge--with a parameter $η$ capturing how much to rely on each. We illustrate that incorporating predictive knowledge through an $ηγ$-discounted SF model makes more efficient use of sampled experience, compared to either extreme, i.e. bootstrapping entirely on the value function estimate, or bootstrapping on the product of separately estimated successor features and instantaneous reward models. We empirically show this approach leads to faster policy evaluation and better control performance, for tabular and nonlinear function approximations, indicating scalability and generality.