任务不可能的离线增强学习的潜在计划

论文标题

任务不可能的离线增强学习的潜在计划

Latent Plans for Task-Agnostic Offline Reinforcement Learning

论文作者

Rosete-Beas, Erick, Mees, Oier, Kalweit, Gabriel, Boedecker, Joschka, Burgard, Wolfram

论文摘要

长摩根的日常任务和包括一系列隐式子任务的序列仍然在离线机器人控制中构成了重大挑战。尽管许多先前的方法旨在通过模仿和离线增强学习的变体来解决此设置，但学习的行为通常是狭窄的，并且通常会努力达到可配置的可配置的长跑目标。由于这两个范式都具有互补的优势和劣势，因此我们提出了一种新型的层次结构方法，结合了两种方法的优势，以从高维相机观察中学习任务无关的长跑策略。具体而言，我们结合了一项低级政策，该政策通过模仿学习学习潜在技能，以及从离线强化学习中学到的高级政策，以促进潜在的行为先验。各种模拟和真实机器人控制任务的实验表明，我们的公式使以前看不见的技能组合可以通过将潜在技能“缝制”潜在的技能来实现时间扩展目标，并在最先进的盆地上效果提高绩效顺序。我们甚至还学到了一个多任务视觉运动策略，用于现实世界中25个不同的操纵任务，这既优于模仿学习和离线强化学习技术。

Everyday tasks of long-horizon and comprising a sequence of multiple implicit subtasks still impose a major challenge in offline robot control. While a number of prior methods aimed to address this setting with variants of imitation and offline reinforcement learning, the learned behavior is typically narrow and often struggles to reach configurable long-horizon goals. As both paradigms have complementary strengths and weaknesses, we propose a novel hierarchical approach that combines the strengths of both methods to learn task-agnostic long-horizon policies from high-dimensional camera observations. Concretely, we combine a low-level policy that learns latent skills via imitation learning and a high-level policy learned from offline reinforcement learning for skill-chaining the latent behavior priors. Experiments in various simulated and real robot control tasks show that our formulation enables producing previously unseen combinations of skills to reach temporally extended goals by "stitching" together latent skills through goal chaining with an order-of-magnitude improvement in performance upon state-of-the-art baselines. We even learn one multi-task visuomotor policy for 25 distinct manipulation tasks in the real world which outperforms both imitation learning and offline reinforcement learning techniques.

下载PDF全文

下载文献需遵守相关版权规定

论文标题