自我监督语音表示形式的比较，作为无监督的声音嵌入的输入特征

论文标题

自我监督语音表示形式的比较，作为无监督的声音嵌入的输入特征

A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings

论文作者

van Staden, Lisa, Kamper, Herman

论文摘要

许多语音处理任务涉及测量语音段之间的声学相似性。声词嵌入（AWE）可以通过将任意持续时间的语音段映射到固定维矢量来进行有效的比较。对于零资源的语音处理（无标记的语音是唯一可用的资源），一些最佳的敬畏方法依赖于自动发现的类似单词段的片段的弱自上而下的约束。另一条零资源研究的一条线已经着眼于在短时框架级别上学习的一系列Zero-Resource研究。最近的方法包括自我监督的预测性编码和通信自动编码器（CAE）模型。在本文中，我们考虑将这些框架级特征用作训练无监督AWE模型的输入时是否有益。我们将框架级别的特征从对比度预测编码（CPC），自回归预测性编码和CAE进行比较到常规MFCC。这些用作基于CAE的AWE模型的输入。在英语和Xitsonga数据上的单词歧视任务中，这三个表示学习方法的表现都优于MFCC，CPC始终显示出最大的改进。在跨语性实验中，我们发现接受英语训练的CPC功能也可以转移到Xitsonga。

Many speech processing tasks involve measuring the acoustic similarity between speech segments. Acoustic word embeddings (AWE) allow for efficient comparisons by mapping speech segments of arbitrary duration to fixed-dimensional vectors. For zero-resource speech processing, where unlabelled speech is the only available resource, some of the best AWE approaches rely on weak top-down constraints in the form of automatically discovered word-like segments. Rather than learning embeddings at the segment level, another line of zero-resource research has looked at representation learning at the short-time frame level. Recent approaches include self-supervised predictive coding and correspondence autoencoder (CAE) models. In this paper we consider whether these frame-level features are beneficial when used as inputs for training to an unsupervised AWE model. We compare frame-level features from contrastive predictive coding (CPC), autoregressive predictive coding and a CAE to conventional MFCCs. These are used as inputs to a recurrent CAE-based AWE model. In a word discrimination task on English and Xitsonga data, all three representation learning approaches outperform MFCCs, with CPC consistently showing the biggest improvement. In cross-lingual experiments we find that CPC features trained on English can also be transferred to Xitsonga.

下载PDF全文

下载文献需遵守相关版权规定

论文标题