探测语音特性的声学表示

论文标题

探测语音特性的声学表示

Probing Acoustic Representations for Phonetic Properties

论文作者

Ma, Danni, Ryant, Neville, Liberman, Mark

论文摘要

诸如WAV2VEC和DECOAR之类的预训练的声学表示已达到了语音识别基准的令人印象深刻的单词错误率（WER），尤其是当标记的数据受到限制时。但是，对于这些各种表示的语音属性以及它们编码可转移的语音特征的能力，知之甚少。我们比较了一些简单的帧级语音分类任务中两个常规和四个预训练系统的功能，并与分类器对TIMIT数据集的一个版本进行了培训，并对另一个版本的功能进行了测试。所有上下文化表示都提供了跨域的一定程度的可传递性，并且在更多音频数据上进行了预培训的模型可获得更好的结果；但是总的来说，具有最简单体系结构的系统DecoAR表现最好。因此，这种类型的基准分析可以发现各种提出的声学表示的相对强度。

Pre-trained acoustic representations such as wav2vec and DeCoAR have attained impressive word error rates (WER) for speech recognition benchmarks, particularly when labeled data is limited. But little is known about what phonetic properties these various representations acquire, and how well they encode transferable features of speech. We compare features from two conventional and four pre-trained systems in some simple frame-level phonetic classification tasks, with classifiers trained on features from one version of the TIMIT dataset and tested on features from another. All contextualized representations offered some level of transferability across domains, and models pre-trained on more audio data give better results; but overall, DeCoAR, the system with the simplest architecture, performs best. This type of benchmarking analysis can thus uncover relative strengths of various proposed acoustic representations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题