Tamformer：带有学习注意力面罩的多模式变压器用于早期意图预测

论文标题

Tamformer：带有学习注意力面罩的多模式变压器用于早期意图预测

TAMFormer: Multi-Modal Transformer with Learned Attention Mask for Early Intent Prediction

论文作者

Osman, Nada, Camporese, Guglielmo, Ballan, Lamberto

论文摘要

人类意图预测是一个不断增长的研究领域，基于视觉的系统必须预期视频中的活动。为此，该模型创造了过去的表示，随后，它产生了关于即将到来的场景的未来假设。在这项工作中，我们专注于行人的早期意图预测，从当前对城市场景的观察，该模型可以预测接近街道的行人的未来活动。我们的方法基于一个多模式变压器，该变压器编码过去的观测值并在不同的预期时间产生多个预测。此外，我们建议学习基于变压器的模型（时间自适应掩码变压器）的注意力面罩，以权衡不同的时间依赖性和过去的时间依赖性。我们对几个公共基准测试的方法进行了研究，以进行早期意图预测，从而改善了与以前的作品相比在不同预期时间的预测性能。

Human intention prediction is a growing area of research where an activity in a video has to be anticipated by a vision-based system. To this end, the model creates a representation of the past, and subsequently, it produces future hypotheses about upcoming scenarios. In this work, we focus on pedestrians' early intention prediction in which, from a current observation of an urban scene, the model predicts the future activity of pedestrians that approach the street. Our method is based on a multi-modal transformer that encodes past observations and produces multiple predictions at different anticipation times. Moreover, we propose to learn the attention masks of our transformer-based model (Temporal Adaptive Mask Transformer) in order to weigh differently present and past temporal dependencies. We investigate our method on several public benchmarks for early intention prediction, improving the prediction performances at different anticipation times compared to the previous works.

下载PDF全文

下载文献需遵守相关版权规定

论文标题