自我监督的预测学习：一种无效的视觉场景中声源本地化的方法

论文标题

自我监督的预测学习：一种无效的视觉场景中声源本地化的方法

Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

论文作者

Song, Zengjie, Wang, Yuxi, Fan, Junsong, Tan, Tieniu, Zhang, Zhaoxiang

论文摘要

视觉场景中的声源本地化旨在将对象定位在给定图像中发出声音。最近显示出令人印象深刻的本地化表现的作品通常取决于对比度学习框架。但是，正如这些方法中通常采用的否定性的随机抽样可能导致音频和视觉特征之间的错位，从而引起本地化的歧义。在本文中，我们没有遵循以前的文献，而是提出了自我监督的预测学习（SSPL），这是一种通过显式阳性挖掘的负面方法来定位的负面方法。具体而言，我们首先设计了一个三际之复的网络，将声音源与一个相应的视频框架的两个增强视图相关联，从而导致音频和视觉功能之间具有语义相干的相似性。其次，我们引入了一个新颖的预测编码模块，用于视听特征对齐。这样的模块有助于SSPL以渐进的方式专注于目标对象，并有效地降低了积极的学习难度。实验显示出令人惊讶的结果，SSPL的表现优于两个标准声音定位基准的最先进方法。特别是，与以前的最佳状态相比，SSPL在SoundNet-Flickr上实现了8.6％CIOU和3.4％的AUC。代码可在以下网址获得：https：//github.com/zjsong/sspl。

Sound source localization in visual scenes aims to localize objects emitting the sound in a given image. Recent works showing impressive localization performance typically rely on the contrastive learning framework. However, the random sampling of negatives, as commonly adopted in these methods, can result in misalignment between audio and visual features and thus inducing ambiguity in localization. In this paper, instead of following previous literature, we propose Self-Supervised Predictive Learning (SSPL), a negative-free method for sound localization via explicit positive mining. Specifically, we first devise a three-stream network to elegantly associate sound source with two augmented views of one corresponding video frame, leading to semantically coherent similarities between audio and visual features. Second, we introduce a novel predictive coding module for audio-visual feature alignment. Such a module assists SSPL to focus on target objects in a progressive manner and effectively lowers the positive-pair learning difficulty. Experiments show surprising results that SSPL outperforms the state-of-the-art approach on two standard sound localization benchmarks. In particular, SSPL achieves significant improvements of 8.6% cIoU and 3.4% AUC on SoundNet-Flickr compared to the previous best. Code is available at: https://github.com/zjsong/SSPL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题