通过层融合和特定区域的动态层注意的自动驾驶汽车的接地命令

论文标题

通过层融合和特定区域的动态层注意的自动驾驶汽车的接地命令

Grounding Commands for Autonomous Vehicles via Layer Fusion with Region-specific Dynamic Layer Attention

论文作者

Chan, Hou Pong, Guo, Mingxi, Xu, Cheng-Zhong

论文摘要

对视觉环境进行命令是自动驾驶汽车与人之间相互作用的重要组成部分。在这项工作中，我们研究了自动驾驶汽车的语言基础问题，该问题旨在根据乘客的自然语言命令在视觉场景中定位一个区域。先前的工作仅采用视觉和语言预培训模型的顶层表示，以预测命令所指的区域。但是，这种方法省略了在其他层中编码的有用功能，从而导致对输入场景和命令的理解不足。为了应对此限制，我们介绍了此任务的第一层融合方法。由于不同的视觉区域可能需要不同类型的特征来互相歧义它们，因此我们进一步提出了特定区域的动态（RSD）层的注意，以适应每个区域的多模式信息。对Talk2CAR基准的广泛实验表明，我们的方法有助于预测更准确的区域，并表现优于最先进的方法。

Grounding a command to the visual environment is an essential ingredient for interactions between autonomous vehicles and humans. In this work, we study the problem of language grounding for autonomous vehicles, which aims to localize a region in a visual scene according to a natural language command from a passenger. Prior work only employs the top layer representations of a vision-and-language pre-trained model to predict the region referred to by the command. However, such a method omits the useful features encoded in other layers, and thus results in inadequate understanding of the input scene and command. To tackle this limitation, we present the first layer fusion approach for this task. Since different visual regions may require distinct types of features to disambiguate them from each other, we further propose the region-specific dynamic (RSD) layer attention to adaptively fuse the multimodal information across layers for each region. Extensive experiments on the Talk2Car benchmark demonstrate that our approach helps predict more accurate regions and outperforms state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题