论文标题
在未观察到的上下文信息下转移强化学习
Transfer Reinforcement Learning under Unobserved Contextual Information
论文作者
论文摘要
在本文中,我们研究了一个转移强化学习问题,在该问题中,国家过渡和奖励受环境环境的影响。具体来说,我们考虑了一个可以访问上下文感知策略的演示代理,并可以基于该策略生成过渡和奖励数据。这些数据构成了演示者的经验。然后,目标是将这种经验(不包括基本上下文信息)转移给无法访问环境环境的学习者,以便他们可以使用较少的样本来学习控制策略。众所周知,无视上下文信息的因果效应,可以在学习者估计的过渡和奖励模型中引入偏见,从而导致学习的次优政策。为了应对这一挑战,在本文中,我们开发了一种使用演示者数据来获得过渡和奖励功能的因果关系的方法,然后我们将其用于在值函数上获得因果关系。使用这些值函数边界,我们提出了新的Q学习和UCB-Q学习算法,这些算法会收敛到没有偏差的真实值函数。我们为机器人运动计划问题提供数值实验,以验证提出的价值函数边界,并证明所提出的算法可以有效利用演示者的数据来加速学习者的学习过程。
In this paper, we study a transfer reinforcement learning problem where the state transitions and rewards are affected by the environmental context. Specifically, we consider a demonstrator agent that has access to a context-aware policy and can generate transition and reward data based on that policy. These data constitute the experience of the demonstrator. Then, the goal is to transfer this experience, excluding the underlying contextual information, to a learner agent that does not have access to the environmental context, so that they can learn a control policy using fewer samples. It is well known that, disregarding the causal effect of the contextual information, can introduce bias in the transition and reward models estimated by the learner, resulting in a learned suboptimal policy. To address this challenge, in this paper, we develop a method to obtain causal bounds on the transition and reward functions using the demonstrator's data, which we then use to obtain causal bounds on the value functions. Using these value function bounds, we propose new Q learning and UCB-Q learning algorithms that converge to the true value function without bias. We provide numerical experiments for robot motion planning problems that validate the proposed value function bounds and demonstrate that the proposed algorithms can effectively make use of the data from the demonstrator to accelerate the learning process of the learner.