论文标题
房间 - 宽敞的房间:多语言视觉和语言导航,并具有密集的时空接地
Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding
论文作者
论文摘要
我们介绍了新的视觉和语言导航(VLN)数据集的Room-across-room(RXR)。与其他VLN数据集相比,RXR是多语言(英语,印地语和泰卢固语),并且比其他VLN数据集更大。它通过解决路径中的已知偏见并引起对可见实体的更多参考来强调语言在VLN中的作用。此外,指令中的每个单词都与指令创建者和验证者的虚拟姿势相符。在包括房间到室内注释时,我们为单语和多语言设置和多任务学习建立了基线分数。我们还为一个模型提供了结果,该模型仅通过专注于人类示范中的全景范围来从同步姿势痕迹中学习。 RXR的大小,范围和细节急剧扩展了对模拟的,真实环境中体现语言代理的研究的前沿。
We introduce Room-Across-Room (RxR), a new Vision-and-Language Navigation (VLN) dataset. RxR is multilingual (English, Hindi, and Telugu) and larger (more paths and instructions) than other VLN datasets. It emphasizes the role of language in VLN by addressing known biases in paths and eliciting more references to visible entities. Furthermore, each word in an instruction is time-aligned to the virtual poses of instruction creators and validators. We establish baseline scores for monolingual and multilingual settings and multitask learning when including Room-to-Room annotations. We also provide results for a model that learns from synchronized pose traces by focusing only on portions of the panorama attended to in human demonstrations. The size, scope and detail of RxR dramatically expands the frontier for research on embodied language agents in simulated, photo-realistic environments.