利用单帧视觉声音源分离的类别信息

论文标题

利用单帧视觉声音源分离的类别信息

Leveraging Category Information for Single-Frame Visual Sound Source Separation

论文作者

Zhu, Lingyu, Rahtu, Esa

论文摘要

视觉声源分离旨在通过视觉提示的存在从给定的声音混合物中识别声音组件。先前的工作表现出了令人印象深刻的结果，但是大型多阶段体系结构和复杂的数据表示（例如光流轨迹）的费用。相比之下，我们仅使用单个视频框架研究了简单而有效的模型，以进行视觉声音分离。此外，我们的模型能够在分离过程中利用声源类别的信息。为此，我们提出了两个模型，其中我们假设i）类别标签在培训时间可用，或者ii）我们知道培训样本对是来自相同还是不同的类别。音乐数据集的实验表明，与最近的几种基线方法相比，我们的模型获得了可比或更好的性能。该代码可在https://github.com/ly-zhu/leveraging-category-information-for-single-frame-frame-visual-sound-sound-source-source-source-separation上获得

Visual sound source separation aims at identifying sound components from a given sound mixture with the presence of visual cues. Prior works have demonstrated impressive results, but with the expense of large multi-stage architectures and complex data representations (e.g. optical flow trajectories). In contrast, we study simple yet efficient models for visual sound separation using only a single video frame. Furthermore, our models are able to exploit the information of the sound source category in the separation process. To this end, we propose two models where we assume that i) the category labels are available at the training time, or ii) we know if the training sample pairs are from the same or different category. The experiments with the MUSIC dataset show that our model obtains comparable or better performance compared to several recent baseline methods. The code is available at https://github.com/ly-zhu/Leveraging-Category-Information-for-Single-Frame-Visual-Sound-Source-Separation

下载PDF全文

下载文献需遵守相关版权规定

论文标题