论文标题
用于城市自动驾驶的基于政策搜索模型的指导搜索模型学习
Guided Policy Search Model-based Reinforcement Learning for Urban Autonomous Driving
论文作者
论文摘要
在本文中,我们通过引入基于模型的RL方法来驾驶Carla Urban Driving Simulator中的自动驾驶汽车来继续使用模仿学习(IL)和模型免费强化学习(RL)来学习自动驾驶的驾驶政策。尽管事实证明,IL和免费RL方法能够解决许多具有挑战性的任务,包括玩视频游戏,机器人以及在我们先前的工作,城市驾驶中,但此类方法的较低样本效率极大地限制了他们在实际自动驾驶中的应用。在这项工作中,我们开发了一种基于模型的RL指导政策搜索(GPS),用于城市驾驶任务。该算法迭代地学习了一个参数化的动态模型,以近似复杂和交互式驾驶任务,并优化了非线性近似动态模型下的驾驶策略。作为基于模型的RL方法,当GPS应用于城市自主驾驶中时,具有更高样本效率,更好的可解释性和更高稳定性的优势。我们提供了广泛的实验,以验证拟议方法的有效性,以学习卡拉城市驾驶的强大驾驶政策。我们还将提出的方法与其他策略搜索和模型免费的RL基准进行了比较,显示了基于GPS的RL方法的100倍的样本效率,并且基于GPS的方法可以学习基线方法几乎无法学习的更艰难任务的策略。
In this paper, we continue our prior work on using imitation learning (IL) and model free reinforcement learning (RL) to learn driving policies for autonomous driving in urban scenarios, by introducing a model based RL method to drive the autonomous vehicle in the Carla urban driving simulator. Although IL and model free RL methods have been proved to be capable of solving lots of challenging tasks, including playing video games, robots, and, in our prior work, urban driving, the low sample efficiency of such methods greatly limits their applications on actual autonomous driving. In this work, we developed a model based RL algorithm of guided policy search (GPS) for urban driving tasks. The algorithm iteratively learns a parameterized dynamic model to approximate the complex and interactive driving task, and optimizes the driving policy under the nonlinear approximate dynamic model. As a model based RL approach, when applied in urban autonomous driving, the GPS has the advantages of higher sample efficiency, better interpretability, and greater stability. We provide extensive experiments validating the effectiveness of the proposed method to learn robust driving policy for urban driving in Carla. We also compare the proposed method with other policy search and model free RL baselines, showing 100x better sample efficiency of the GPS based RL method, and also that the GPS based method can learn policies for harder tasks that the baseline methods can hardly learn.