通过平行约束的政策优化，自动驾驶的安全加强学习

论文标题

通过平行约束的政策优化，自动驾驶的安全加强学习

Safe Reinforcement Learning for Autonomous Vehicles through Parallel Constrained Policy Optimization

论文作者

Wen, Lu, Duan, Jingliang, Li, Shengbo Eben, Xu, Shaobing, Peng, Huei

论文摘要

强化学习（RL）由于解决复杂分类和控制问题的潜力，吸引了对自主驾驶的兴趣。但是，现有的RL算法很少用于两个主要问题的真实车辆：行为是无法解释的，并且在新情况下不能保证安全性。本文针对两个自动驾驶任务提出了一种安全的RL算法，称为“平行约束策略优化（PCPO）”。 PCPO将当今的共同参与者 - 批判性架构扩展到了三组分学习框架，其中三个神经网络分别用于近似策略功能，价值功能和新添加的风险功能。同时，添加了信任区域约束，以允许大型更新步骤，而不会破坏单调改进条件。为了确保安全受限问题的可行性，采用同步的并行学习者来探索不同的状态空间，从而加速学习和政策的历史。对自动驾驶汽车的两种情况的模拟证实，我们可以在实现快速学习的同时确保安全。

Reinforcement learning (RL) is attracting increasing interests in autonomous driving due to its potential to solve complex classification and control problems. However, existing RL algorithms are rarely applied to real vehicles for two predominant problems: behaviours are unexplainable, and they cannot guarantee safety under new scenarios. This paper presents a safe RL algorithm, called Parallel Constrained Policy Optimization (PCPO), for two autonomous driving tasks. PCPO extends today's common actor-critic architecture to a three-component learning framework, in which three neural networks are used to approximate the policy function, value function and a newly added risk function, respectively. Meanwhile, a trust region constraint is added to allow large update steps without breaking the monotonic improvement condition. To ensure the feasibility of safety constrained problems, synchronized parallel learners are employed to explore different state spaces, which accelerates learning and policy-update. The simulations of two scenarios for autonomous vehicles confirm we can ensure safety while achieving fast learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题