论文标题
元学习资格痕迹,以进行更有效的时间差异学习
META-Learning Eligibility Traces for More Sample Efficient Temporal Difference Learning
论文作者
论文摘要
时间差异(TD)学习是一种标准且非常成功的增强学习方法,是两种算法的核心,这些算法学习给定政策的价值以及学习如何改善政策的算法。带有资格痕迹的TD学习提供了一种进行时间信用分配的方法,即确定应将奖励的哪些部分分配给在不同以前发生的前任状态,该状态由不同的前任状态分配,该状态由参数$λ$控制。但是,调整此参数可能很耗时,并且不调整会导致学习效率低下。为了提高TD学习的样本效率,我们提出了一种以状态依赖性方式调整资格痕量参数的元学习方法。适应性是在辅助学习者的帮助下实现的,这些辅助学习者在线学习有关更新目标的分发信息,每步与通常的价值学习者大致相同的计算复杂性。我们的方法可以在政策和非政策学习中使用。我们证明,在某些假设下,提出的方法通过最大程度地减少总体目标误差来提高更新目标的整体质量。可以将此方法视为插件,也可用于通过元学习功能(观察)基于$λ$在线的函数近似的预测,甚至在控制案例中以帮助改进政策。我们的经验评估表明,提出的算法对学习率差异的鲁棒性有了显着的改善,并改善了鲁棒性。
Temporal-Difference (TD) learning is a standard and very successful reinforcement learning approach, at the core of both algorithms that learn the value of a given policy, as well as algorithms which learn how to improve policies. TD-learning with eligibility traces provides a way to do temporal credit assignment, i.e. decide which portion of a reward should be assigned to predecessor states that occurred at different previous times, controlled by a parameter $λ$. However, tuning this parameter can be time-consuming, and not tuning it can lead to inefficient learning. To improve the sample efficiency of TD-learning, we propose a meta-learning method for adjusting the eligibility trace parameter, in a state-dependent manner. The adaptation is achieved with the help of auxiliary learners that learn distributional information about the update targets online, incurring roughly the same computational complexity per step as the usual value learner. Our approach can be used both in on-policy and off-policy learning. We prove that, under some assumptions, the proposed method improves the overall quality of the update targets, by minimizing the overall target error. This method can be viewed as a plugin which can also be used to assist prediction with function approximation by meta-learning feature (observation)-based $λ$ online, or even in the control case to assist policy improvement. Our empirical evaluation demonstrates significant performance improvements, as well as improved robustness of the proposed algorithm to learning rate variation.