论文标题
语义视听导航
Semantic Audio-Visual Navigation
论文作者
论文摘要
关于视听导航的最新工作假设了一个听起来不断的目标,并限制了音频在发出目标位置的作用。我们介绍了语义视听导航,环境中的物体听起来与它们的语义含义(例如,厕所冲洗,门吱吱作响)一致,声学事件是零星或持续时间短的。我们提出了一个基于变压器的模型来应对这项新的语义音频任务,并结合了一个推断的目标描述符,该靶描述符捕获目标的空间和语义属性。我们模型的持续多模式内存使其能够在声学事件停止后很长时间达到目标。为了支持新任务,我们还扩展了Soundspaces音频模拟,以在MatterPort3D中为一系列对象提供语义接地的声音。我们的方法通过学习与语义,声学和视觉提示相关联,极大地胜过现有的视听导航方法。
Recent work on audio-visual navigation assumes a constantly-sounding target and restricts the role of audio to signaling the target's position. We introduce semantic audio-visual navigation, where objects in the environment make sounds consistent with their semantic meaning (e.g., toilet flushing, door creaking) and acoustic events are sporadic or short in duration. We propose a transformer-based model to tackle this new semantic AudioGoal task, incorporating an inferred goal descriptor that captures both spatial and semantic properties of the target. Our model's persistent multimodal memory enables it to reach the goal even long after the acoustic event stops. In support of the new task, we also expand the SoundSpaces audio simulations to provide semantically grounded sounds for an array of objects in Matterport3D. Our method strongly outperforms existing audio-visual navigation methods by learning to associate semantic, acoustic, and visual cues.