stept-rl：语音上声音表示学习的语音文本纠缠

论文标题

stept-rl：语音上声音表示学习的语音文本纠缠

STEPs-RL: Speech-Text Entanglement for Phonetically Sound Representation Learning

论文作者

Mishra, Prakamya

论文摘要

在本文中，我们提出了一种新颖的多模式深神经网络体系结构，该架构使用语音和文本纠缠来学习语音声音说话字表示。使用其上下文口语的语音和文本来预测目标口语的语音序列的训练，以预测目标口语的语音顺序，以便该模型编码其有意义的潜在表示。与现有作品不同，我们已经使用文本和语音来进行听觉表示学习，以捕获语义和句法信息以及声学和时间信息。我们的模型产生的潜在表示不仅能够以89.47％的精度预测目标语音序列，而且还能够在四个广泛使用的Word相似性基准数据集中评估时，能够对文本单词表示模型，Word2Vec和FastText（在文本成绩单上进行培训）获得竞争成果。此外，对生成的矢量空间的研究还证明了所提出的模型捕获口语词的语音结构的能力。据我们所知，现有作品都没有使用语音和文本纠缠来学习口语表示，这首先使这项工作成为同类工作。

In this paper, we present a novel multi-modal deep neural network architecture that uses speech and text entanglement for learning phonetically sound spoken-word representations. STEPs-RL is trained in a supervised manner to predict the phonetic sequence of a target spoken-word using its contextual spoken word's speech and text, such that the model encodes its meaningful latent representations. Unlike existing work, we have used text along with speech for auditory representation learning to capture semantical and syntactical information along with the acoustic and temporal information. The latent representations produced by our model were not only able to predict the target phonetic sequences with an accuracy of 89.47% but were also able to achieve competitive results to textual word representation models, Word2Vec & FastText (trained on textual transcripts), when evaluated on four widely used word similarity benchmark datasets. In addition, investigation of the generated vector space also demonstrated the capability of the proposed model to capture the phonetic structure of the spoken-words. To the best of our knowledge, none of the existing works use speech and text entanglement for learning spoken-word representation, which makes this work first of its kind.

下载PDF全文

下载文献需遵守相关版权规定

论文标题