Flowgrad：使用动作进行视觉声源本地化

论文标题

Flowgrad：使用动作进行视觉声源本地化

FlowGrad: Using Motion for Visual Sound Source Localization

论文作者

Singh, Rajsuryan, Zinemanas, Pablo, Serra, Xavier, Bello, Juan Pablo, Fuentes, Magdalena

论文摘要

视觉声源本地化的最新工作依赖于以自我监督的方式学习的语义音频视觉表示，并且设计不包括视频中存在的时间信息。虽然事实证明它对于广泛使用的基准数据集有效，但该方法对于诸如城市交通等挑战场景而言却差不多。这项工作将时间上下文引入了最新的方法，以使用光流作为编码运动信息的手段在城市场景中进行声源本地化的最新方法。对我们方法的优势和劣势的分析有助于我们更好地理解视觉声源本地化的问题，并阐明了视听场景理解的开放挑战。

Most recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for widely used benchmark datasets, the method falls short for challenging scenarios like urban traffic. This work introduces temporal context into the state-of-the-art methods for sound source localization in urban scenes using optical flow as a means to encode motion information. An analysis of the strengths and weaknesses of our methods helps us better understand the problem of visual sound source localization and sheds light on open challenges for audio-visual scene understanding.

下载PDF全文

下载文献需遵守相关版权规定

论文标题