多模式文本样式转移，用于户外视觉和语言导航

论文标题

多模式文本样式转移，用于户外视觉和语言导航

Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation

论文作者

Zhu, Wanrong, Wang, Xin Eric, Fu, Tsu-Jui, Yan, An, Narayana, Pradyumna, Sone, Kazoo, Basu, Sugato, Wang, William Yang

论文摘要

自然语言处理（NLP）中最具挑战性的主题之一是视觉上的语言理解和推理。户外视觉和语言导航（VLN）是一项任务，代理商遵循自然语言说明并导航现实生活中的城市环境。由于缺乏说明复杂城市场景的人类通知指令，户外VLN仍然是一项艰巨的任务。本文介绍了多模式文本样式转移（MTST）学习方法，并利用外部多模式资源来减轻室外导航任务中的数据稀缺性。我们首先通过转移Google Maps API生成的指令的样式来丰富导航数据，然后使用增强的外部户外导航数据集预先训练导航器。实验结果表明，我们的MTST学习方法是模型不合时宜的，我们的MTST方法在室外VLN任务上的表现明显优于基线模型，在测试集上相对相对，将任务完成率提高了8.7％。

One of the most challenging topics in Natural Language Processing (NLP) is visually-grounded language understanding and reasoning. Outdoor vision-and-language navigation (VLN) is such a task where an agent follows natural language instructions and navigates a real-life urban environment. Due to the lack of human-annotated instructions that illustrate intricate urban scenes, outdoor VLN remains a challenging task to solve. This paper introduces a Multimodal Text Style Transfer (MTST) learning approach and leverages external multimodal resources to mitigate data scarcity in outdoor navigation tasks. We first enrich the navigation data by transferring the style of the instructions generated by Google Maps API, then pre-train the navigator with the augmented external outdoor navigation dataset. Experimental results show that our MTST learning approach is model-agnostic, and our MTST approach significantly outperforms the baseline models on the outdoor VLN task, improving task completion rate by 8.7% relatively on the test set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题