论文标题
结构编码辅助任务,以改善视觉和语言导航中的视觉表示
Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language Navigation
论文作者
论文摘要
在视觉和语言导航(VLN)中,研究人员通常会在ImageNet上预先训练的图像编码器,而无需对代理会经过训练或测试的环境进行微调。但是,训练图像从成像网和导航环境中的视图之间的分布变化可能会导致ImageNet预先训练的图像编码器次优。因此,在本文中,我们设计了一组结构编码的辅助任务(SEA),这些任务(SEA)利用导航环境中的数据来预先培训和改进图像编码器。具体来说,我们设计和自定义(1)3D拼图,(2)遍历性预测和(3)实例分类以预先训练图像编码器。通过严格的消融,我们的SEA预训练特征显示出可以更好地编码场景的结构信息,而ImageNet预训练的功能无法正确编码,但对于目标导航任务至关重要。海上预训练的功能可以轻松插入现有的VLN代理,而无需进行任何调整。例如,在测试环境中,VLN代理与我们的SEA预训练特征相结合,可使说话者追随者的绝对成功率提高了12%,Env-DropOut的5%和AUXRN的4%。
In Vision-and-Language Navigation (VLN), researchers typically take an image encoder pre-trained on ImageNet without fine-tuning on the environments that the agent will be trained or tested on. However, the distribution shift between the training images from ImageNet and the views in the navigation environments may render the ImageNet pre-trained image encoder suboptimal. Therefore, in this paper, we design a set of structure-encoding auxiliary tasks (SEA) that leverage the data in the navigation environments to pre-train and improve the image encoder. Specifically, we design and customize (1) 3D jigsaw, (2) traversability prediction, and (3) instance classification to pre-train the image encoder. Through rigorous ablations, our SEA pre-trained features are shown to better encode structural information of the scenes, which ImageNet pre-trained features fail to properly encode but is crucial for the target navigation task. The SEA pre-trained features can be easily plugged into existing VLN agents without any tuning. For example, on Test-Unseen environments, the VLN agents combined with our SEA pre-trained features achieve absolute success rate improvement of 12% for Speaker-Follower, 5% for Env-Dropout, and 4% for AuxRN.