清晰：通过跨语性的，环境不可分割的表示，改善视力语言导航

论文标题

清晰：通过跨语性的，环境不可分割的表示，改善视力语言导航

CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations

论文作者

Li, Jialu, Tan, Hao, Bansal, Mohit

论文摘要

视觉和语言导航（VLN）任务要求代理根据语言说明浏览环境。在本文中，我们旨在解决此任务中的两个关键挑战：利用多语言指令进行改进的说明路径接地，并在培训期间看不见的新环境中导航。为了应对这些挑战，我们提出了清晰的：跨语性和环境不可屈服的表示。首先，我们的经纪人在室内室内数据集中学习了三种语言（英语，印地语和泰卢固语）的共享且视觉上的跨语性语言表示。我们的语言表示学习由视觉信息对齐的文本对指导。其次，我们的代理通过从不同环境中最大化语义对齐的图像对（对对象匹配的限制）之间的相似性来学习环境不足的视觉表示。我们的环境不可知的视觉表示可以减轻低级视觉信息引起的环境偏差。从经验上讲，在房间间室数据集上，我们表明，当通过跨语性语言表示和环境 - 敏捷的视觉表示形式概括，我们的多语言代理在强大的基线模型上都对所有指标进行了巨大改进。此外，我们表明我们学到的语言和视觉表示可以成功地转移到房间和合作的视觉和二元式导航任务，并提出详细的定性和定量的概括和基础分析。我们的代码可从https://github.com/jialuli-luka/clear获得

Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions. In this paper, we aim to solve two key challenges in this task: utilizing multilingual instructions for improved instruction-path grounding and navigating through new environments that are unseen during training. To address these challenges, we propose CLEAR: Cross-Lingual and Environment-Agnostic Representations. First, our agent learns a shared and visually-aligned cross-lingual language representation for the three languages (English, Hindi and Telugu) in the Room-Across-Room dataset. Our language representation learning is guided by text pairs that are aligned by visual information. Second, our agent learns an environment-agnostic visual representation by maximizing the similarity between semantically-aligned image pairs (with constraints on object-matching) from different environments. Our environment agnostic visual representation can mitigate the environment bias induced by low-level visual information. Empirically, on the Room-Across-Room dataset, we show that our multilingual agent gets large improvements in all metrics over the strong baseline model when generalizing to unseen environments with the cross-lingual language representation and the environment-agnostic visual representation. Furthermore, we show that our learned language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task, and present detailed qualitative and quantitative generalization and grounding analysis. Our code is available at https://github.com/jialuli-luka/CLEAR

下载PDF全文

下载文献需遵守相关版权规定

论文标题