使用蒙版自动编码器学习通用音频表示形式的掩蔽频谱图

论文标题

使用蒙版自动编码器学习通用音频表示形式的掩蔽频谱图

Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation

论文作者

Niizumi, Daisuke, Takeuchi, Daiki, Ohishi, Yasunori, Harada, Noboru, Kashino, Kunio

论文摘要

最近的通用音频表示形式显示了各种音频任务的最新性能。这些表示形式是由自我监督的学习方法预先训练的，这些学习方法可以从输入中创建培训信号。例如，典型的音频对比学习使用输入声音之间的时间关系来创建训练信号，而某些方法则使用数据增强创建的输入视图之间的差异。但是，这些训练信号并未提供从完整输入声音中得出的信息，我们认为这是描述输入原样的学习表示的次优。在本文中，我们寻求从输入本身中学习音频表示形式，以自动编码蒙版频谱图，蒙版光谱图建模（MSM，应用于音频谱图的掩蔽图像建模的变体）的借口任务。为了实现MSM，我们使用蒙版自动编码器（MAE），这是一种自我监督的学习方法。 Mae学会了有效地将少量可见贴片编码为潜在表示形式，以携带基本信息，以重建大量掩盖贴片。在训练期间，MAE将重建错误最小化，该错误将输入用作训练信号，因此实现了我们的目标。我们使用MAE（MSM-MAE）模型在HER 2021 Neurips Challenge的评估基准下对MSM进行了实验。我们的MSM-MAE模型在15个任务中有7个（例如，Crema-d的精度为73.4％，Libricount上的85.8％的精度为73.4％，在其他任务上表现出更好的表现更好，在其他专业型号表现更好的情况下表现出最佳性能。我们还研究了MSM-MAE的设计选择如何影响性能，并对可视化结果进行定性分析，以了解对学习的表示形式。我们使我们的代码在线提供。

Recent general-purpose audio representations show state-of-the-art performance on various audio tasks. These representations are pre-trained by self-supervised learning methods that create training signals from the input. For example, typical audio contrastive learning uses temporal relationships among input sounds to create training signals, whereas some methods use a difference among input views created by data augmentations. However, these training signals do not provide information derived from the intact input sound, which we think is suboptimal for learning representation that describes the input as it is. In this paper, we seek to learn audio representations from the input itself as supervision using a pretext task of auto-encoding of masked spectrogram patches, Masked Spectrogram Modeling (MSM, a variant of Masked Image Modeling applied to audio spectrogram). To implement MSM, we use Masked Autoencoders (MAE), an image self-supervised learning method. MAE learns to efficiently encode the small number of visible patches into latent representations to carry essential information for reconstructing a large number of masked patches. While training, MAE minimizes the reconstruction error, which uses the input as training signal, consequently achieving our goal. We conducted experiments on our MSM using MAE (MSM-MAE) models under the evaluation benchmark of the HEAR 2021 NeurIPS Challenge. Our MSM-MAE models outperformed the HEAR 2021 Challenge results on seven out of 15 tasks (e.g., accuracies of 73.4% on CREMA-D and 85.8% on LibriCount), while showing top performance on other tasks where specialized models perform better. We also investigate how the design choices of MSM-MAE impact the performance and conduct qualitative analysis of visualization outcomes to gain an understanding of learned representations. We make our code available online.

下载PDF全文

下载文献需遵守相关版权规定

论文标题