论文标题
新型互动专家的演示和应用于无人层面船的自动泊位控制系统的示范中的强化学习
Reinforcement Learning from Demonstrations by Novel Interactive Expert and Application to Automatic Berthing Control Systems for Unmanned Surface Vessel
论文作者
论文摘要
在本文中,开发了两种新型的加固学习方法(RLFD),并应用于无人表面容器的自动泊位控制系统。一种新的专家数据生成方法,称为基于模型预测的专家(MPBE),该方法结合了模型预测控制和深层确定性策略梯度,可为RLFD算法提供高质量的监督数据。首先,通过用MPBE替换RL代理以直接与环境相互作用来引入直接的RLFD方法,即模型预测性深层确定性策略梯度(MP-DDPG)。然后分析分布不匹配问题的MP-DDPG,并提出了两种减轻分布不匹配的技术。此外,存在另一种基于MP-DDPG的RLFD算法,称为自引导参与者 - 批评(SGAC),它可以通过不断查询其在线生成高质量的专家数据来有效利用MPBE。 SGAC以匕首的方式解决了导致不稳定学习过程的分配不匹配问题。此外,给出理论分析以证明SGAC算法可以通过保证的单调改进来收敛。仿真结果验证了MP-DDPG和SGAC完成船舶泊位控制任务的有效性,并显示了SGAC与其他典型的强化学习算法和MP-DDPG相比的优势。
In this paper, two novel practical methods of Reinforcement Learning from Demonstration (RLfD) are developed and applied to automatic berthing control systems for Unmanned Surface Vessel. A new expert data generation method, called Model Predictive Based Expert (MPBE) which combines Model Predictive Control and Deep Deterministic Policy Gradient, is developed to provide high quality supervision data for RLfD algorithms. A straightforward RLfD method, model predictive Deep Deterministic Policy Gradient (MP-DDPG), is firstly introduced by replacing the RL agent with MPBE to directly interact with the environment. Then distribution mismatch problem is analyzed for MP-DDPG, and two techniques that alleviate distribution mismatch are proposed. Furthermore, another novel RLfD algorithm based on the MP-DDPG, called Self-Guided Actor-Critic (SGAC) is present, which can effectively leverage MPBE by continuously querying it to generate high quality expert data online. The distribution mismatch problem leading to unstable learning process is addressed by SGAC in a DAgger manner. In addition, theoretical analysis is given to prove that SGAC algorithm can converge with guaranteed monotonic improvement. Simulation results verify the effectiveness of MP-DDPG and SGAC to accomplish the ship berthing control task, and show advantages of SGAC comparing with other typical reinforcement learning algorithms and MP-DDPG.