论文标题
用很少的资源量化语言变异
Quantifying Language Variation Acoustically with Few Resources
论文作者
论文摘要
深层声学模型代表基于大量数据的语言信息。不幸的是,对于区域语言和方言,这些资源大多不可用。但是,深层的声学模型可能已经学习了传递低资源语言的语言信息。在这项研究中,我们通过区分低资源(荷兰)区域品种的任务来评估这种情况。通过从各种WAV2VEC 2.0模型的隐藏层中提取嵌入(包括预先训练和/或在荷兰语上进行微调的新型号),并使用动态时间翘曲,我们计算了来自四种(区域)语言的100多个单独的方言的成对发音差异。然后,我们将所得差异矩阵聚集在四组中,并将它们与金标准进行比较,并根据比较语音转录进行分区。我们的结果表明,声学模型在不需要语音转录的情况下优于(传统的)基于转录的方法,而多语言XLSR-53模型在荷兰语上进行了最佳性能。在仅六秒钟的语音基础上,由此产生的聚类与黄金标准非常匹配。
Deep acoustic models represent linguistic information based on massive amounts of data. Unfortunately, for regional languages and dialects such resources are mostly not available. However, deep acoustic models might have learned linguistic information that transfers to low-resource languages. In this study, we evaluate whether this is the case through the task of distinguishing low-resource (Dutch) regional varieties. By extracting embeddings from the hidden layers of various wav2vec 2.0 models (including new models which are pre-trained and/or fine-tuned on Dutch) and using dynamic time warping, we compute pairwise pronunciation differences averaged over 10 words for over 100 individual dialects from four (regional) languages. We then cluster the resulting difference matrix in four groups and compare these to a gold standard, and a partitioning on the basis of comparing phonetic transcriptions. Our results show that acoustic models outperform the (traditional) transcription-based approach without requiring phonetic transcriptions, with the best performance achieved by the multilingual XLSR-53 model fine-tuned on Dutch. On the basis of only six seconds of speech, the resulting clustering closely matches the gold standard.