预期视力和语言导航的看不见的差异

论文标题

预期视力和语言导航的看不见的差异

Anticipating the Unseen Discrepancy for Vision and Language Navigation

论文作者

Lu, Yujie, Zhang, Huiliang, Nie, Ping, Feng, Weixi, Xu, Wenda, Wang, Xin Eric, Wang, William Yang

论文摘要

视觉导航要求代理商遵循自然语言说明以达到特定目标。可见的环境和看不见的环境之间的巨大差异使代理商概括地概括。先前的研究提出了数据增强方法，以明确或隐式地减轻数据偏差并提供概括的改进。但是，他们试图记住增强的轨迹，并在测试时忽略在看不见的环境下的分布变化。在本文中，我们提出了一个看不见的差异，预期视力和语言导航（戴维斯），该差异通过鼓励测试时间的视觉一致性学会概括为看不见的环境。具体来说，我们设计了：1）半监督的框架戴维斯（Davis），该框架利用类似的语义观测来利用视觉一致性信号。 2）一个两阶段的学习程序，鼓励适应测试时间分布。该框架增强了模仿和增强学习的基本混合物与动量形成对比，以鼓励在联合训练阶段和测试时间适应阶段对类似观察的稳定决策。广泛的实验表明，戴维斯在R2R和RXR基准上实现了与先前最先进的VLN基准相比，实现了模型不合命源性的改进。我们的源代码和数据是补充材料。

Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target. The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well. Previous studies propose data augmentation methods to mitigate the data bias explicitly or implicitly and provide improvements in generalization. However, they try to memorize augmented trajectories and ignore the distribution shifts under unseen environments at test time. In this paper, we propose an Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency. Specifically, we devise: 1) a semi-supervised framework DAVIS that leverages visual consistency signals across similar semantic observations. 2) a two-stage learning procedure that encourages adaptation to test-time distribution. The framework enhances the basic mixture of imitation and reinforcement learning with Momentum Contrast to encourage stable decision-making on similar observations under a joint training stage and a test-time adaptation stage. Extensive experiments show that DAVIS achieves model-agnostic improvement over previous state-of-the-art VLN baselines on R2R and RxR benchmarks. Our source code and data are in supplemental materials.

下载PDF全文

下载文献需遵守相关版权规定

论文标题