无监督的跨模式音频表示从非结构化的多语言文本学习

论文标题

无监督的跨模式音频表示从非结构化的多语言文本学习

Unsupervised Cross-Modal Audio Representation Learning from Unstructured Multilingual Text

论文作者

Schindler, Alexander, Gordea, Sergiu, Knees, Peter

论文摘要

我们提出了一种无监督的音频表示学习的方法。根据三胞胎神经网络体系结构，我们利用语义相关的跨模式信息来估计音轨相关性。通过应用潜在的语义索引（LSI），我们将相应的文本信息嵌入到潜在的矢量空间中，从中我们从中得出了在线三重态选择的跟踪相关性。该LSI主题建模有助于使用卷积复发性神经网络（CRNN）学习相似和不同音频轨道对的细粒度选择。通过这种情况，我们将非结构化文本模式的语义上下文直接投影到音频模式的学习表示空间，而无需从中派生结构化的地面真相注释。我们通过利用众多欧洲数字图书馆提供的多语言元数据来评估欧洲声音集合的方法，并展示如何改善数字音频库中的搜索。我们表明，我们的方法是多种注释样式以及该系列的不同语言的不变性。学识渊博的表示形式与手工特征的基线相当，分别以相似性检索精度在较高的截止点上超过了基线，仅基线功能向量长度的15％。

We present an approach to unsupervised audio representation learning. Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness. By applying Latent Semantic Indexing (LSI) we embed corresponding textual information into a latent vector space from which we derive track relatedness for online triplet selection. This LSI topic modelling facilitates fine-grained selection of similar and dissimilar audio-track pairs to learn the audio representation using a Convolution Recurrent Neural Network (CRNN). By this we directly project the semantic context of the unstructured text modality onto the learned representation space of the audio modality without deriving structured ground-truth annotations from it. We evaluate our approach on the Europeana Sounds collection and show how to improve search in digital audio libraries by harnessing the multilingual meta-data provided by numerous European digital libraries. We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection. The learned representations perform comparable to the baseline of handcrafted features, respectively exceeding this baseline in similarity retrieval precision at higher cut-offs with only 15\% of the baseline's feature vector length.

下载PDF全文

下载文献需遵守相关版权规定

论文标题