论文标题

可证明样品有效的RL,并提供有关潜在动态的附带信息

Provably Sample-Efficient RL with Side Information about Latent Dynamics

论文作者

Liu, Yao, Misra, Dipendra, Dudík, Miro, Schapire, Robert E.

论文摘要

我们在观测值高维的设置中研究强化学习(RL),但是RL代理可以访问有关状态空间结构的抽象知识,例如,例如,当机器人被任命用于使用其自身相机的观测值的特定房间,同时可以使用自己的相机,同时可以访问平面图。我们将此设置形式化为从抽象模拟器中的转移加固学习,我们认为这是确定性的(例如,在平面图周围移动的简单模型),但仅需要捕获目标域的潜在潜在动力学,该动态近似于未知(有限的)扰动(考虑环境随机性)。至关重要的是,我们没有对目标域中观察结果结构的先验知识,除非可以使用它们来识别潜在状态(但解码图是未知的)。在这些假设下,我们提出了一种称为TASID的算法,该算法在目标域中学习了强大的策略,样本复杂性在地平线上是多项式的,并且与状态的数量无关,而无需访问某些先验知识,这是不可能的。在合成实验中,我们验证算法的各种特性,并表明它在经验上优于需要访问“完整模拟器”(即那些也模拟观测值的)的转移RL算法。

We study reinforcement learning (RL) in settings where observations are high-dimensional, but where an RL agent has access to abstract knowledge about the structure of the state space, as is the case, for example, when a robot is tasked to go to a specific room in a building using observations from its own camera, while having access to the floor plan. We formalize this setting as transfer reinforcement learning from an abstract simulator, which we assume is deterministic (such as a simple model of moving around the floor plan), but which is only required to capture the target domain's latent-state dynamics approximately up to unknown (bounded) perturbations (to account for environment stochasticity). Crucially, we assume no prior knowledge about the structure of observations in the target domain except that they can be used to identify the latent states (but the decoding map is unknown). Under these assumptions, we present an algorithm, called TASID, that learns a robust policy in the target domain, with sample complexity that is polynomial in the horizon, and independent of the number of states, which is not possible without access to some prior knowledge. In synthetic experiments, we verify various properties of our algorithm and show that it empirically outperforms transfer RL algorithms that require access to "full simulators" (i.e., those that also simulate observations).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源