使用VQ-VAE从发音和声学特征发现的自我监督的语音单元发现

论文标题

使用VQ-VAE从发音和声学特征发现的自我监督的语音单元发现

Self-supervised speech unit discovery from articulatory and acoustic features using VQ-VAE

论文作者

Georges, Marc-Antoine, Schwartz, Jean-Luc, Hueber, Thomas

论文摘要

在处理听觉语音输入时，通常假定人类的感知系统可以招募运动知识。本研究使用旋转性建模和深度学习，研究了如何将此发音信息用于在自我监督的环境中发现语音单元。我们使用矢量定量的变分自动编码器（VQ-VAE）从发音和声音语音数据中学习离散表示。与零资源范式一致，然后使用ABX测试来研究提取的表示的语音相关属性。实验是在三个不同的英语和法语的不同语料库上进行的。我们发现，关节信息而不是在发音的位置组织潜在的表示，而语音声学主要以发音方式来构造潜在空间。我们表明，这两种方式的最佳融合可以使这些语音维度的联合表示比单独考虑的每种方式更准确。由于通常在实际情况下无法获得发音信息，因此我们最终以自我监督的方式从言语声学中推断出的好处。

The human perception system is often assumed to recruit motor knowledge when processing auditory speech inputs. Using articulatory modeling and deep learning, this study examines how this articulatory information can be used for discovering speech units in a self-supervised setting. We used vector-quantized variational autoencoders (VQ-VAE) to learn discrete representations from articulatory and acoustic speech data. In line with the zero-resource paradigm, an ABX test was then used to investigate how the extracted representations encode phonetically relevant properties. Experiments were conducted on three different corpora in English and French. We found that articulatory information rather organises the latent representations in terms of place of articulation whereas the speech acoustics mainly structure the latent space in terms of manner of articulation. We show that an optimal fusion of the two modalities can lead to a joint representation of these phonetic dimensions more accurate than each modality considered individually. Since articulatory information is usually not available in a practical situation, we finally investigate the benefit it provides when inferred from the speech acoustics in a self-supervised manner.

下载PDF全文

下载文献需遵守相关版权规定

论文标题