通过预训练来学习视力和语言导航的通用代理

论文标题

通过预训练来学习视力和语言导航的通用代理

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training

论文作者

Hao, Weituo, Li, Chunyuan, Li, Xiujun, Carin, Lawrence, Gao, Jianfeng

论文摘要

在自然语言指令下学习在视觉环境中进行导航是一项具有挑战性的任务，因为代理的多模式输入是高度可变的，并且有关新任务的培训数据通常受到限制。在本文中，我们提出了第一个用于视觉和语言导航（VLN）任务的预训练和微调范式。通过以自我监督的学习方式对大量图像文本 - 演奏三重态进行培训，预先训练的模型提供了视觉环境和语言说明的通用表示。它可以很容易地用作现有VLN框架的倒入，从而导致提议的代理称为普遍。它在新任务中更有效地学习，并在以前看不见的环境中更好地概括。在三个VLN任务上验证了该性能。在房间到室的基准测试中，我们的模型将最先进的方法从47％提高到51％的成功率。此外，学习的表示形式可以转移到其他VLN任务。在最近的两项任务上，即视觉和言语导航和“帮助，安娜！”拟议的普遍性导致对现有方法的显着改善，从而实现了新的最新状态。

Learning to navigate in a visual environment following natural-language instructions is a challenging task, because the multimodal inputs to the agent are highly variable, and the training data on a new task is often limited. In this paper, we present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks. By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions. It can be easily used as a drop-in for existing VLN frameworks, leading to the proposed agent called Prevalent. It learns more effectively in new tasks and generalizes better in a previously unseen environment. The performance is validated on three VLN tasks. On the Room-to-Room benchmark, our model improves the state-of-the-art from 47% to 51% on success rate weighted by path length. Further, the learned representation is transferable to other VLN tasks. On two recent tasks, vision-and-dialog navigation and "Help, Anna!" the proposed Prevalent leads to significant improvement over existing methods, achieving a new state of the art.

下载PDF全文

下载文献需遵守相关版权规定

论文标题