无监督语音表示学习的卷积深层马尔可夫模型学习

论文标题

无监督语音表示学习的卷积深层马尔可夫模型学习

A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning

论文作者

Khurana, Sameer, Laurent, Antoine, Hsu, Wei-Ning, Chorowski, Jan, Lancucki, Adrian, Marxer, Ricard, Glass, James

论文摘要

概率潜在变量模型（LVM）为语言表示从语音学习学习提供了自我监督学习方法的替代方法。 LVMS承认一种直观的概率解释，其中潜在结构塑造了从信号中提取的信息。尽管由于引入了变异自动编码器（VAE），但最近LVM最近引起了人们的重新兴趣，但它们用于语音表示学习的使用仍然在很大程度上尚未得到探索。在这项工作中，我们提出了卷积深度马尔可夫模型（Convdmm），这是一种高斯状态空间模型，具有非线性发射和通过深神经网络建模的过渡功能。这种无监督的模型是使用黑匣子变分推断训练的。深度卷积神经网络用作结构化变分近似的推理网络。当在大规模的语音数据集（LibrisPeech）上接受培训时，Convdmm会产生在Wall Street Journal DataSet上的线性电话分类和识别的多个自我监督的功能提取方法。此外，我们发现Convdmm补充了诸如WAV2VEC和PASE之类的自我监督方法，从而改善了仅通过任何方法而获得的结果。最后，我们发现ConvdMM具有比极端低资产型在极端低资产型中的任何其他功能的学习能力更好的手机识别器，而标记的训练示例很少。

Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech. LVMs admit an intuitive probabilistic interpretation where the latent structure shapes the information extracted from the signal. Even though LVMs have recently seen a renewed interest due to the introduction of Variational Autoencoders (VAEs), their use for speech representation learning remains largely unexplored. In this work, we propose Convolutional Deep Markov Model (ConvDMM), a Gaussian state-space model with non-linear emission and transition functions modelled by deep neural networks. This unsupervised model is trained using black box variational inference. A deep convolutional neural network is used as an inference network for structured variational approximation. When trained on a large scale speech dataset (LibriSpeech), ConvDMM produces features that significantly outperform multiple self-supervised feature extracting methods on linear phone classification and recognition on the Wall Street Journal dataset. Furthermore, we found that ConvDMM complements self-supervised methods like Wav2Vec and PASE, improving on the results achieved with any of the methods alone. Lastly, we find that ConvDMM features enable learning better phone recognizers than any other features in an extreme low-resource regime with few labeled training examples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题