蒙版的自动编码器聆听

论文标题

蒙版的自动编码器聆听

Masked Autoencoders that Listen

论文作者

Huang, Po-Yao, Xu, Hu, Li, Juncheng, Baevski, Alexei, Auli, Michael, Galuba, Wojciech, Metze, Florian, Feichtenhofer, Christoph

论文摘要

本文研究了基于图像的掩盖自动编码器（MAE）的简单扩展，以从音频谱图中学习。在MAE中的变压器编码器编码器设计之后，我们的Audio-MAE首先用高掩模比编码音频谱图片段，仅通过编码器层喂食非遮盖的令牌。然后，解码器重新订购并解码编码的上下文，并用掩码令牌填充，以重建输入频谱图。我们发现将本地窗户注意力纳入解码器是有益的，因为音频谱图在当地时间和频带中高度相关。然后，我们在目标数据集上以较低的掩模比微调编码器。从经验上讲，音频MAE在六个音频和语音分类任务上设定了新的最先进的性能，超过了使用外部监督预训练的其他最新模型。代码和模型将在https://github.com/facebookresearch/audiomae上。

This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. The code and models will be at https://github.com/facebookresearch/AudioMAE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题