通过基础控制器学习长跑的稀疏奖励机器人操纵器任务

论文标题

通过基础控制器学习长跑的稀疏奖励机器人操纵器任务

Learning of Long-Horizon Sparse-Reward Robotic Manipulator Tasks with Base Controllers

论文作者

Wang, Guangming, Xin, Minjian, Wu, Wenhua, Liu, Zhe, Wang, Hesheng

论文摘要

深度强化学习（DRL）使机器人能够端到端执行一些智能任务。但是，对于长马稀疏的机器人操纵器任务，仍然存在许多挑战。一方面，稀疏的奖励设置导致勘探效率低下。另一方面，使用物理机器人的探索是高成本且不安全的。在本文中，我们提出了一种使用本文中称为基本控制器的现有传统控制器的长马稀疏奖励任务的方法。我们的算法建立在深层确定性政策梯度（DDPG）的基础上，将现有的基本控制器纳入探索，价值学习和策略更新的阶段。此外，我们提出了合成不同基础控制器以整合其优势的直接方式。通过从堆叠块到杯子的实验，可以证明，基于州或基于图像的基于州或基于图像的策略稳步优于基本控制器。与以前从示范中学习的工作相比，我们的方法通过数量级来提高样本效率并改善了性能。总体而言，我们的方法具有利用现有的工业机器人操纵系统来构建更灵活和智能控制器的潜力。

Deep Reinforcement Learning (DRL) enables robots to perform some intelligent tasks end-to-end. However, there are still many challenges for long-horizon sparse-reward robotic manipulator tasks. On the one hand, a sparse-reward setting causes exploration inefficient. On the other hand, exploration using physical robots is of high cost and unsafe. In this paper, we propose a method of learning long-horizon sparse-reward tasks utilizing one or more existing traditional controllers named base controllers in this paper. Built upon Deep Deterministic Policy Gradients (DDPG), our algorithm incorporates the existing base controllers into stages of exploration, value learning, and policy update. Furthermore, we present a straightforward way of synthesizing different base controllers to integrate their strengths. Through experiments ranging from stacking blocks to cups, it is demonstrated that the learned state-based or image-based policies steadily outperform base controllers. Compared to previous works of learning from demonstrations, our method improves sample efficiency by orders of magnitude and improves the performance. Overall, our method bears the potential of leveraging existing industrial robot manipulation systems to build more flexible and intelligent controllers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题