非平行截止与声音综合的多视图时间对齐

论文标题

非平行截止与声音综合的多视图时间对齐

Multi-view Temporal Alignment for Non-parallel Articulatory-to-Acoustic Speech Synthesis

论文作者

Gonzalez-Lopez, Jose A., Gonzalez-Atienza, Miriam, Gomez-Alanis, Alejandro, Perez-Cordoba, Jose L., Green, Phil D.

论文摘要

发音到声学（A2A）的综合是指从术语表达者捕获的运动中产生的声音。该技术有许多应用程序，例如恢复与由于疾病或受伤无法再说的人的口头沟通。到目前为止，大多数成功的技术采用了有监督的学习框架，在这种框架中，使用时间同步的发音和语音录音来训练监督的机器学习算法，以后可以使用该算法将其映射到语音上。但是，这可以防止在不可用数据的情况下应用A2A技术，例如，一个人已经丢失了她/他的声音，并且只能捕获发音数据。在这项工作中，我们根据多视图学习理论提出了解决此问题的解决方案。所提出的算法试图通过将它们投影到一个共同的潜在空间中，并在其中两种视图均最大相关，然后应用动态时间循环，从而找到具有相同语音含量的非对齐的发音和声音序列之间的最佳时间对齐。讨论和探索了这个想法的几种变体。我们表明，在非对准场景中产生的语音质量与平行场景中获得的语音质量相当。

Articulatory-to-acoustic (A2A) synthesis refers to the generation of audible speech from captured movement of the speech articulators. This technique has numerous applications, such as restoring oral communication to people who cannot longer speak due to illness or injury. Most successful techniques so far adopt a supervised learning framework, in which time-synchronous articulatory-and-speech recordings are used to train a supervised machine learning algorithm that can be used later to map articulator movements to speech. This, however, prevents the application of A2A techniques in cases where parallel data is unavailable, e.g., a person has already lost her/his voice and only articulatory data can be captured. In this work, we propose a solution to this problem based on the theory of multi-view learning. The proposed algorithm attempts to find an optimal temporal alignment between pairs of non-aligned articulatory-and-acoustic sequences with the same phonetic content by projecting them into a common latent space where both views are maximally correlated and then applying dynamic time warping. Several variants of this idea are discussed and explored. We show that the quality of speech generated in the non-aligned scenario is comparable to that obtained in the parallel scenario.

下载PDF全文

下载文献需遵守相关版权规定

论文标题