具有贝叶斯规则的视觉和语言导航中的生成语言基础政策

论文标题

具有贝叶斯规则的视觉和语言导航中的生成语言基础政策

Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule

论文作者

Kurita, Shuhei, Cho, Kyunghyun

论文摘要

视觉和语言导航（VLN）是一项任务，其中代理体现在现实的3D环境中，并遵循指令达到目标节点。尽管以前的大多数研究都建立并研究了一种歧视方法，但我们注意到，实际上有两种可能的方法来构建这种VLN代理：歧视性\ textit {and}生成性。在本文中，我们设计和调查了一种具有生成语言的策略，该策略使用语言模型来计算所有可能的指令，即给定动作和过渡历史记录的所有可能的词汇令牌序列。在实验中，我们表明所提出的生成方法的表现优于2室（R2R）和4室（R4R）（R4R）数据集中的歧视方法，尤其是在看不见的环境中。我们进一步表明，生成和歧视性策略的组合在R2R数据集中取得了接近最先进的结果，这表明生成性和歧视性策略捕获了VLN的不同方面。

Vision-and-language navigation (VLN) is a task in which an agent is embodied in a realistic 3D environment and follows an instruction to reach the goal node. While most of the previous studies have built and investigated a discriminative approach, we notice that there are in fact two possible approaches to building such a VLN agent: discriminative \textit{and} generative. In this paper, we design and investigate a generative language-grounded policy which uses a language model to compute the distribution over all possible instructions i.e. all possible sequences of vocabulary tokens given action and the transition history. In experiments, we show that the proposed generative approach outperforms the discriminative approach in the Room-2-Room (R2R) and Room-4-Room (R4R) datasets, especially in the unseen environments. We further show that the combination of the generative and discriminative policies achieves close to the state-of-the art results in the R2R dataset, demonstrating that the generative and discriminative policies capture the different aspects of VLN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题