分析影响自我监督的预培训表示的有用性以进行语音识别的因素

论文标题

分析影响自我监督的预培训表示的有用性以进行语音识别的因素

Analyzing the factors affecting usefulness of Self-Supervised Pre-trained Representations for Speech Recognition

论文作者

Seth, Ashish, Prasad, Lodagala V S V Durga, Ghosh, Sreyan, Umesh, S.

论文摘要

学习高级语音表示形式的自学学习（SSL）一直是在低资源环境中构建自动语音识别（ASR）系统的流行方法。但是，文献中提出的共同假设是，可以使用可用于SSL预训练的相同域或语言的大量未标记的数据，我们承认，在现实世界中，这是不可行的。在本文中，作为Interspeech Gram Vaani ASR挑战的一部分，我们尝试研究域，语言，数据集大小和上游训练SSL数据对最终性能下游ASR任务的效果。我们还建立在持续的训练范式的基础上，以研究使用SSL训练的模型所拥有的先验知识的影响。广泛的实验和研究表明，ASR系统的性能易受用于SSL预训练的数据。它们的性能随着相似性和训练前数据量的增加而提高。我们认为，我们的工作将有助于语音社区在低资源环境中建立更好的ASR系统，并引导研究改善基于SSL的语音系统预培训的概括。

Self-supervised learning (SSL) to learn high-level speech representations has been a popular approach to building Automatic Speech Recognition (ASR) systems in low-resource settings. However, the common assumption made in literature is that a considerable amount of unlabeled data is available for the same domain or language that can be leveraged for SSL pre-training, which we acknowledge is not feasible in a real-world setting. In this paper, as part of the Interspeech Gram Vaani ASR challenge, we try to study the effect of domain, language, dataset size, and other aspects of our upstream pre-training SSL data on the final performance low-resource downstream ASR task. We also build on the continued pre-training paradigm to study the effect of prior knowledge possessed by models trained using SSL. Extensive experiments and studies reveal that the performance of ASR systems is susceptible to the data used for SSL pre-training. Their performance improves with an increase in similarity and volume of pre-training data. We believe our work will be helpful to the speech community in building better ASR systems in low-resource settings and steer research towards improving generalization in SSL-based pre-training for speech systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题