基于注意的音频嵌入示例查询

论文标题

基于注意的音频嵌入示例查询

Attention-Based Audio Embeddings for Query-by-Example

论文作者

Singh, Anup, Demuynck, Kris, Arora, Vipul

论文摘要

理想的音频检索系统有效且可靠地识别出广泛数据库的简短查询片段。然而，众所周知的音频指纹系统的性能在高信号失真水平下跌幅不足。本文提出了一个音频检索系统，该系统使用对比度学习框架产生了噪音和混响强大的音频指纹。使用这些指纹，该方法执行了全面的搜索来识别查询音频，并精确地估算了参考音频中的时间戳。我们的框架涉及训练CNN，以最大化从干净的音频提取的嵌入及其相应变形和时变的版本之间的相似性。我们采用频道频谱上的注意机制来更好地区分音频，通过对信号中的显着光谱斑块的重量增加。实验结果表明，我们的系统在计算和内存使用方面具有有效的效率，同时更准确，尤其是在更高的失真水平上，比竞争的最先进的系统，并且可扩展到较大的数据库。

An ideal audio retrieval system efficiently and robustly recognizes a short query snippet from an extensive database. However, the performance of well-known audio fingerprinting systems falls short at high signal distortion levels. This paper presents an audio retrieval system that generates noise and reverberation robust audio fingerprints using the contrastive learning framework. Using these fingerprints, the method performs a comprehensive search to identify the query audio and precisely estimate its timestamp in the reference audio. Our framework involves training a CNN to maximize the similarity between pairs of embeddings extracted from clean audio and its corresponding distorted and time-shifted version. We employ a channel-wise spectral-temporal attention mechanism to better discriminate the audio by giving more weight to the salient spectral-temporal patches in the signal. Experimental results indicate that our system is efficient in computation and memory usage while being more accurate, particularly at higher distortion levels, than competing state-of-the-art systems and scalable to a larger database.

下载PDF全文

下载文献需遵守相关版权规定

论文标题