人类指导的机器人行为学习：基于GAN辅助的基于偏好的增强学习方法

论文标题

人类指导的机器人行为学习：基于GAN辅助的基于偏好的增强学习方法

Human-guided Robot Behavior Learning: A GAN-assisted Preference-based Reinforcement Learning Approach

论文作者

Zhan, Huixin, Tao, Feng, Cao, Yongcan

论文摘要

人类的示威活动可以为机器人培训增强算法的信任样本，以在现实世界环境中学习复杂的行为。但是，获得足够的演示可能是不切实际的，因为许多行为对于人类很难证明。一种更实用的方法是用人类查询（即基于偏好的强化学习）代替人类示范。现有算法的一个关键局限性是需要大量的人类查询，因为需要大量标记的数据来训练神经网络以近似连续，高维奖励功能。为了减少和最大程度地减少对人类查询的需求，我们提出了一种新的基于人类偏好的基于人类偏好的增强学习方法，该方法使用生成性对抗网络（GAN）积极学习人类的偏好，然后取代人类在分配偏爱中的作用。对抗性神经网络很简单，只有二进制输出，因此需要较少的人类查询才能训练。此外，最大的基于熵的增强学习算法旨在塑造对所需区域的损失或远离不希望区域的损失。为了显示拟议方法的有效性，我们在典型的Mujoco机器人运动环境中介绍了有关复杂机器人任务的一些研究。获得的结果表明，我们的方法可以减少大约99.8％的人类时间，而无需牺牲表现。

Human demonstrations can provide trustful samples to train reinforcement learning algorithms for robots to learn complex behaviors in real-world environments. However, obtaining sufficient demonstrations may be impractical because many behaviors are difficult for humans to demonstrate. A more practical approach is to replace human demonstrations by human queries, i.e., preference-based reinforcement learning. One key limitation of the existing algorithms is the need for a significant amount of human queries because a large number of labeled data is needed to train neural networks for the approximation of a continuous, high-dimensional reward function. To reduce and minimize the need for human queries, we propose a new GAN-assisted human preference-based reinforcement learning approach that uses a generative adversarial network (GAN) to actively learn human preferences and then replace the role of human in assigning preferences. The adversarial neural network is simple and only has a binary output, hence requiring much less human queries to train. Moreover, a maximum entropy based reinforcement learning algorithm is designed to shape the loss towards the desired regions or away from the undesired regions. To show the effectiveness of the proposed approach, we present some studies on complex robotic tasks without access to the environment reward in a typical MuJoCo robot locomotion environment. The obtained results show our method can achieve a reduction of about 99.8% human time without performance sacrifice.

下载PDF全文

下载文献需遵守相关版权规定

论文标题