通过自回归和跨语言的电话吸引网络无监督的子词建模的有效性

论文标题

通过自回归和跨语言的电话吸引网络无监督的子词建模的有效性

The effectiveness of unsupervised subword modeling with autoregressive and cross-lingual phone-aware networks

论文作者

Feng, Siyuan, Scharenborg, Odette

论文摘要

这项研究介绍了无监督的子字建模，即学习声学特征表示，可以区分语言的子单词单元。我们提出了一个两阶段的学习框架，该框架结合了自我监督的学习和跨语性知识转移。该框架包括自回旋预测编码（APC）作为前端和跨语言深神经网络（DNN）作为后端。对Libri-Light and Zerospeech 2017数据库进行的ABX子字可区分性任务的实验表明，我们的方法具有竞争力或优于最先进的研究。在音素和发音特征（AF）级别进行的全面和系统分析表明，我们的方法比单次元音信息更好地捕获双重元音，而观察到不同类型辅音的信息量的差异也差异。此外，在后端捕获音素信息的有效性与分配给音素的跨语言手机标签的质量之间发现了正相关。 AF级分析与T-SNE可视化结果一起表明，在捕获发音信息，元音高度和背部信息的捕获方式和地点中，所提出的方法比MFCC和APC功能更好。综上所述，分析表明，我们方法中的两个阶段都可以有效地捕获音素和AF信息。然而，单一元音信息的捕获范围不及辅音信息，这表明未来的研究应重点放在改善捕获单一元音信息上。

This study addresses unsupervised subword modeling, i.e., learning acoustic feature representations that can distinguish between subword units of a language. We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer. The framework consists of autoregressive predictive coding (APC) as the front-end and a cross-lingual deep neural network (DNN) as the back-end. Experiments on the ABX subword discriminability task conducted with the Libri-light and ZeroSpeech 2017 databases showed that our approach is competitive or superior to state-of-the-art studies. Comprehensive and systematic analyses at the phoneme- and articulatory feature (AF)-level showed that our approach was better at capturing diphthong than monophthong vowel information, while also differences in the amount of information captured for different types of consonants were observed. Moreover, a positive correlation was found between the effectiveness of the back-end in capturing a phoneme's information and the quality of the cross-lingual phone labels assigned to the phoneme. The AF-level analysis together with t-SNE visualization results showed that the proposed approach is better than MFCC and APC features in capturing manner and place of articulation information, vowel height, and backness information. Taken together, the analyses showed that the two stages in our approach are both effective in capturing phoneme and AF information. Nevertheless, monophthong vowel information is less well captured than consonant information, which suggests that future research should focus on improving capturing monophthong vowel information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题