论文标题
视听细分
Audio-Visual Segmentation
论文作者
论文摘要
我们建议探索一个称为视听分段(AVS)的新问题,其中目标是输出在图像帧时产生声音的对象的像素级映射。为了促进这项研究,我们构建了第一个视频分割基准(AVSBENCH),为声音视频中的声音对象提供了像素的注释。使用此基准测试了两个设置:1)具有单个声源的半监督音频分割和2)完全监督的音频段分段,并带有多个声源。为了解决AVS问题,我们提出了一种新型方法,该方法使用时间像素的视听相互作用模块注入音频语义作为视觉分割过程的指导。我们还设计正规化损失,以鼓励训练期间的视听映射。 AVSBench上的定量和定性实验将我们的方法与相关任务中的几种现有方法进行了比较,这表明所提出的方法有望在音频和像素视觉语义之间建立桥梁。代码可在https://github.com/opennlplab/avsbench上找到。
We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos. Two settings are studied with this benchmark: 1) semi-supervised audio-visual segmentation with a single sound source and 2) fully-supervised audio-visual segmentation with multiple sound sources. To deal with the AVS problem, we propose a novel method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage the audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench compare our approach to several existing methods from related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench.