通过聆听来启动含义：对口语嵌入的无监督学习

论文标题

通过聆听来启动含义：对口语嵌入的无监督学习

Bootstrapping meaning through listening: Unsupervised learning of spoken sentence embeddings

论文作者

Zhu, Jian, Tian, Zuoyu, Liu, Yadong, Zhang, Cong, Lo, Chia-wen

论文摘要

直接从语音信号引起语义表示是一项高度挑战的任务，但在语音挖掘和口语理解中具有许多有用的应用。这项研究解决了对口语语音的语义表达的无监督学习。通过将语音信号转换为由声学单元发现产生的隐藏单元，我们提出了波浪，这是一种多模式的顺序自动装编码器，可以从言语的密集表示中预测隐藏的单元。其次，我们还建议S-Hubert通过知识蒸馏来诱导意义，在知识蒸馏中，首先对隐藏单元进行训练的句子嵌入模型，并通过对比度学习将其知识传递给语音编码器。最佳性能模型可以与人类判断的中等相关性（0.5〜0.6），而无需依赖任何标签或转录。此外，这些模型也很容易扩展以利用语音的文本抄录来学习与人类注释密切相关的更好的语音嵌入。我们提出的方法适用于开发纯粹的数据驱动系统，用于语音挖掘，索引和搜索。

Inducing semantic representations directly from speech signals is a highly challenging task but has many useful applications in speech mining and spoken language understanding. This study tackles the unsupervised learning of semantic representations for spoken utterances. Through converting speech signals into hidden units generated from acoustic unit discovery, we propose WavEmbed, a multimodal sequential autoencoder that predicts hidden units from a dense representation of speech. Secondly, we also propose S-HuBERT to induce meaning through knowledge distillation, in which a sentence embedding model is first trained on hidden units and passes its knowledge to a speech encoder through contrastive learning. The best performing model achieves a moderate correlation (0.5~0.6) with human judgments, without relying on any labels or transcriptions. Furthermore, these models can also be easily extended to leverage textual transcriptions of speech to learn much better speech embeddings that are strongly correlated with human annotations. Our proposed methods are applicable to the development of purely data-driven systems for speech mining, indexing and search.

下载PDF全文

下载文献需遵守相关版权规定

论文标题