调查用于跨诉说的低资源ASR评估的数据分配策略

论文标题

调查用于跨诉说的低资源ASR评估的数据分配策略

Investigating data partitioning strategies for crosslinguistic low-resource ASR evaluation

论文作者

Liu, Zoey, Spence, Justin, Prud'hommeaux, Emily

论文摘要

许多自动语音识别（ASR）数据集都包含一个单一的预定义测试集，该测试集由一个或多个演讲者组成，其语音从未出现在培训集中。但是，对于说话者数量很少的数据集，这种“持有扬声器”的数据分配策略可能不是理想的选择。这项研究调查了使用最少ASR培训资源的五种语言的十种不同数据拆分方法。我们发现（1）模型性能取决于选择哪种说话者进行测试；（2）所有保留扬声器的平均单词错误率（WER）不仅可以与多个随机拆分的平均wer相当，而且与任何给定的单个随机分式拆分相当；（3）当数据以启发性或对抗性分开时，通常也可以比较；（4）无论数据拆分如何，话语持续时间和强度是可变性的相对预测因素。这些结果表明，使用广泛使用的宣传者使用的ASR数据分配方法可以产生不反映未见数据或说话者模型性能的结果。在面对数据稀少度时，随机拆分可以产生更可靠和可推广的估计。

Many automatic speech recognition (ASR) data sets include a single pre-defined test set consisting of one or more speakers whose speech never appears in the training set. This "hold-speaker(s)-out" data partitioning strategy, however, may not be ideal for data sets in which the number of speakers is very small. This study investigates ten different data split methods for five languages with minimal ASR training resources. We find that (1) model performance varies greatly depending on which speaker is selected for testing; (2) the average word error rate (WER) across all held-out speakers is comparable not only to the average WER over multiple random splits but also to any given individual random split; (3) WER is also generally comparable when the data is split heuristically or adversarially; (4) utterance duration and intensity are comparatively more predictive factors of variability regardless of the data split. These results suggest that the widely used hold-speakers-out approach to ASR data partitioning can yield results that do not reflect model performance on unseen data or speakers. Random splits can yield more reliable and generalizable estimates when facing data sparsity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题