在快车道上计划：学习使用注意力机制在路径积分逆增强学习中进行互动

论文标题

在快车道上计划：学习使用注意力机制在路径积分逆增强学习中进行互动

Planning on the fast lane: Learning to interact using attention mechanisms in path integral inverse reinforcement learning

论文作者

Rosbach, Sascha, Li, Xing, Großjohann, Simon, Homoceanu, Silviu, Roth, Stefan

论文摘要

自动驾驶的通用轨迹计划算法利用复杂的奖励功能来执行战略，行为和运动学特征的组合优化。单个奖励功能的规范和调整是一项繁琐的任务，并且不会概括过各种交通情况。基于路径积分逆增强学习的深度学习方法已成功地应用了使用一组采样驾驶策略的功能来预测依赖局部情况的奖励功能。基于样本的轨迹计划算法能够近似可行驾驶策略的时空子空间，该策略可用于编码情况。但是，与动态对象的相互作用需要扩展的计划范围，这取决于顺序上下文建模。在这项工作中，我们关注延长时间范围内的顺序奖励预测。我们提出了一种神经网络体系结构，该架构使用政策注意机制来通过专注于具有人类驾驶风格的轨迹来产生低维环境向量。除此之外，我们提出了一种时间关注机制，以识别上下文开关并允许稳定的奖励适应。我们在复杂的模拟驾驶情况下（包括其他移动车辆）评估了结果。我们的评估表明，我们的政策注意机制学会着专注于配置空间中的无冲突政策。此外，时间关注机制在扩展计划范围内学习了与其他车辆的持续互动。

General-purpose trajectory planning algorithms for automated driving utilize complex reward functions to perform a combined optimization of strategic, behavioral, and kinematic features. The specification and tuning of a single reward function is a tedious task and does not generalize over a large set of traffic situations. Deep learning approaches based on path integral inverse reinforcement learning have been successfully applied to predict local situation-dependent reward functions using features of a set of sampled driving policies. Sample-based trajectory planning algorithms are able to approximate a spatio-temporal subspace of feasible driving policies that can be used to encode the context of a situation. However, the interaction with dynamic objects requires an extended planning horizon, which depends on sequential context modeling. In this work, we are concerned with the sequential reward prediction over an extended time horizon. We present a neural network architecture that uses a policy attention mechanism to generate a low-dimensional context vector by concentrating on trajectories with a human-like driving style. Apart from this, we propose a temporal attention mechanism to identify context switches and allow for stable adaptation of rewards. We evaluate our results on complex simulated driving situations, including other moving vehicles. Our evaluation shows that our policy attention mechanism learns to focus on collision-free policies in the configuration space. Furthermore, the temporal attention mechanism learns persistent interaction with other vehicles over an extended planning horizon.

下载PDF全文

下载文献需遵守相关版权规定

论文标题