论文标题
人类偏好的偏好监督对深度强化学习
Weak Human Preference Supervision For Deep Reinforcement Learning
论文作者
论文摘要
当前从人类偏好中学习的奖励学习可用于解决复杂的加强学习(RL)任务,而无需访问奖励功能,通过定义轨迹段之间的单个固定偏好。但是,轨迹之间对偏好的判断不是动态的,仍然需要人类的投入超过数千次迭代。在这项研究中,我们提出了一个薄弱的人类偏好监督框架,为此,我们开发了一种人类的偏好缩放模型,自然反映了人类对轨迹之间弱选择程度的看法,并通过监督学习来产生预测的偏好,以减少人类投入的数量。拟议的弱人类偏好监督框架可以有效地解决复杂的RL任务,并在模拟机器人运动(Mujoco Games)中获得更高的累积奖励 - 相对于单个固定人类的偏好。此外,我们既定的人类示范估计量仅需要人类与环境的相互作用的0.01 \%的反馈,而与现有方法相比,人类投入的成本最高为30 \%。为了介绍我们的方法的灵活性,我们发布了一个视频(https://youtu.be/jqpe1oilt0m),显示了对对不同类型人类输入进行培训的代理行为的比较。我们认为,具有弱监督学习的自然启发的人类偏好对精确的奖励学习有益,并且可以应用于最先进的RL系统,例如人类自治组合系统。
The current reward learning from human preferences could be used to resolve complex reinforcement learning (RL) tasks without access to a reward function by defining a single fixed preference between pairs of trajectory segments. However, the judgement of preferences between trajectories is not dynamic and still requires human input over thousands of iterations. In this study, we proposed a weak human preference supervision framework, for which we developed a human preference scaling model that naturally reflects the human perception of the degree of weak choices between trajectories and established a human-demonstration estimator via supervised learning to generate the predicted preferences for reducing the number of human inputs. The proposed weak human preference supervision framework can effectively solve complex RL tasks and achieve higher cumulative rewards in simulated robot locomotion -- MuJoCo games -- relative to the single fixed human preferences. Furthermore, our established human-demonstration estimator requires human feedback only for less than 0.01\% of the agent's interactions with the environment and significantly reduces the cost of human inputs by up to 30\% compared with the existing approaches. To present the flexibility of our approach, we released a video (https://youtu.be/jQPe1OILT0M) showing comparisons of the behaviours of agents trained on different types of human input. We believe that our naturally inspired human preferences with weakly supervised learning are beneficial for precise reward learning and can be applied to state-of-the-art RL systems, such as human-autonomy teaming systems.