奖励对象选择和放置培训的工程

论文标题

奖励对象选择和放置培训的工程

Reward Engineering for Object Pick and Place Training

论文作者

Nagpal, Raghav, Krishnan, Achyuthan Unni, Yu, Hanshen

论文摘要

机器人握把是一个至关重要的研究领域，因为它可以通过从制造业到医疗保健的机器人来加速多个行业的自动化。强化学习是研究领域，代理商通过探索和利用环境的奖励来学习一项政策来执行行动。因此，代理可以使用强化学习来学习如何执行某个任务，在我们的情况下，请抓住对象。我们已经使用了Openai's Gym提供的选拔环境来设计奖励。事后观察经验重播（她）表现出令人鼓舞的结果，而奖励很少的问题。在OpenAI基线和环境的默认配置中，奖励功能是使用目标位置和机器人最终效应器之间的距离计算的。通过基于最终效应器与X，Y和Z轴目标距离的距离加权成本，我们能够与OpenAI提供的基线相比，几乎将学习时间减半，这是一种直观的策略，该策略进一步缩短了学习时间。在这个项目中，我们还能够在学习的政策（城市街区 /曼哈顿轨迹）中介绍某些用户所需的轨迹。这有助于我们理解，通过工程奖励，我们可以通过某种方式调整代理商以某种方式学习政策，即使这可能不是最佳的，而是所需的方式。

Robotic grasping is a crucial area of research as it can result in the acceleration of the automation of several Industries utilizing robots ranging from manufacturing to healthcare. Reinforcement learning is the field of study where an agent learns a policy to execute an action by exploring and exploiting rewards from an environment. Reinforcement learning can thus be used by the agent to learn how to execute a certain task, in our case grasping an object. We have used the Pick and Place environment provided by OpenAI's Gym to engineer rewards. Hindsight Experience Replay (HER) has shown promising results with problems having a sparse reward. In the default configuration of the OpenAI baseline and environment the reward function is calculated using the distance between the target location and the robot end-effector. By weighting the cost based on the distance of the end-effector from the goal in the x,y and z-axes we were able to almost halve the learning time compared to the baselines provided by OpenAI, an intuitive strategy that further reduced learning time. In this project, we were also able to introduce certain user desired trajectories in the learnt policies (city-block / Manhattan trajectories). This helps us understand that by engineering the rewards we can tune the agent to learn policies in a certain way even if it might not be the most optimal but is the desired manner.

下载PDF全文

下载文献需遵守相关版权规定

论文标题