从不完善的机器人布操纵的示范中，目标感知的生成对抗性模仿学习

论文标题

从不完善的机器人布操纵的示范中，目标感知的生成对抗性模仿学习

Goal-Aware Generative Adversarial Imitation Learning from Imperfect Demonstration for Robotic Cloth Manipulation

论文作者

Tsurumine, Yoshihisa, Matsubara, Takamitsu

论文摘要

生成的对抗性模仿学习（GAIL）可以学习政策，而无需明确定义示威活动的奖励功能。盖尔有可能学习具有高维观测值的政策，例如图像。通过将Gail应用于真正的机器人，也许可以为每日活动（例如洗衣服，折叠衣服，烹饪和清洁）获得机器人政策。但是，由于错误，人类示范数据通常是不完美的，这会降低由此产生的策略的表现。我们通过关注以下功能来解决此问题：1）许多机器人任务是目标任务，而2）在演示数据中标记此类目标状态相对容易。考虑到这些，本文提出了目标感知的生成对抗性模仿学习（GA-GAIL），该学习通过引入第二个歧视者来训练政策，以与指示示范数据的第一个歧视者并行区分目标状态。这扩展了一个标准的盖尔框架，即使通过促进实现目标状态的目标状态歧视者从不完善的示威中，甚至可以从不完美的示范中学习理想的政策。此外，GA-GAIL采用熵最大的深层P-NETWORK（EDPN）作为发电机，该发电机在策略更新中考虑了平稳性和因果熵，以从两个歧视者中获得稳定的政策学习。我们提出的方法成功地应用于两个真正的布料操作任务：将手帕翻过来折叠衣服。我们确认它可以学习布料处理政策，而没有特定任务的奖励功能设计。实际实验的视频可在https://youtu.be/h_nii2ooure上获得。

Generative Adversarial Imitation Learning (GAIL) can learn policies without explicitly defining the reward function from demonstrations. GAIL has the potential to learn policies with high-dimensional observations as input, e.g., images. By applying GAIL to a real robot, perhaps robot policies can be obtained for daily activities like washing, folding clothes, cooking, and cleaning. However, human demonstration data are often imperfect due to mistakes, which degrade the performance of the resulting policies. We address this issue by focusing on the following features: 1) many robotic tasks are goal-reaching tasks, and 2) labeling such goal states in demonstration data is relatively easy. With these in mind, this paper proposes Goal-Aware Generative Adversarial Imitation Learning (GA-GAIL), which trains a policy by introducing a second discriminator to distinguish the goal state in parallel with the first discriminator that indicates the demonstration data. This extends a standard GAIL framework to more robustly learn desirable policies even from imperfect demonstrations through a goal-state discriminator that promotes achieving the goal state. Furthermore, GA-GAIL employs the Entropy-maximizing Deep P-Network (EDPN) as a generator, which considers both the smoothness and causal entropy in the policy update, to achieve stable policy learning from two discriminators. Our proposed method was successfully applied to two real-robotic cloth-manipulation tasks: turning a handkerchief over and folding clothes. We confirmed that it learns cloth-manipulation policies without task-specific reward function design. Video of the real experiments are available at https://youtu.be/h_nII2ooUrE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题