论文标题
结合模仿和深入的强化学习,以在虚拟觅食任务上实现人类水平的表现
Combining imitation and deep reinforcement learning to accomplish human-level performance on a virtual foraging task
论文作者
论文摘要
我们开发了一个简单的框架,可以使用人类数据来学习以生物启发的觅食策略。我们进行了一个实验,其中实际上将人类沉浸在开放式觅食环境中,并接受训练以收集最高奖励。引入了马尔可夫决策过程(MDP)框架来对人类决策动态进行建模。然后,基于最大似然估计的模仿学习(IL)用于训练将人类决策映射到观察到的状态的神经网络(NN)。结果表明,被动模仿显然不足以人类的表现。我们通过使用上政策近端策略优化(PPO)算法进一步完善了人类启发的政策,该算法比其他算法表现出更好的稳定性,并可以稳步改善IL预测的政策。我们表明,IL和RL的组合可以匹配人类的结果,并且良好的性能在很大程度上取决于将同义信息与环境的以自我为中心的表示。
We develop a simple framework to learn bio-inspired foraging policies using human data. We conduct an experiment where humans are virtually immersed in an open field foraging environment and are trained to collect the highest amount of rewards. A Markov Decision Process (MDP) framework is introduced to model the human decision dynamics. Then, Imitation Learning (IL) based on maximum likelihood estimation is used to train Neural Networks (NN) that map human decisions to observed states. The results show that passive imitation substantially underperforms humans. We further refine the human-inspired policies via Reinforcement Learning (RL) using the on-policy Proximal Policy Optimization (PPO) algorithm which shows better stability than other algorithms and can steadily improve the policies pretrained with IL. We show that the combination of IL and RL can match human results and that good performance strongly depends on combining the allocentric information with an egocentric representation of the environment.