论文标题
部分可观测时空混沌系统的无模型预测
Resource-Efficient Transfer Learning From Speech Foundation Model Using Hierarchical Feature Fusion
论文作者
论文摘要
对语音基础模型进行自我监督的预培训,然后进行了监督的微调,对自动语音识别(ASR)任务的质量提高了令人印象深刻的改进。用于许多下游任务的微调单独的基础模型很昂贵,因为基础模型通常非常大。参数有效的微调方法(例如,适配器,稀疏更新方法)提供了一种替代范式,其中更新一组参数以使基础模型适应新任务。但是,这些方法仍然遭受高计算记忆成本和缓慢的训练速度,因为它们需要在每个步骤中通过整个神经网络进行反向传播。在本文中,我们在语音识别任务上分析了基础模型不同层的特征性能,并提出了一种新型的分层功能融合方法,用于从语音基础模型中进行资源有效传输学习。实验结果表明,所提出的方法可以比现有算法更少,具有训练参数数量较少,计算记忆成本和更快的训练速度,可以在语音识别任务上实现更好的性能。在所有层与适配器结合使用之后,提出的方法可以实现与以$ 97 \%$ $ $的可训练编码器参数和53美元的$ 53 \%$ $ $ $ $ $ $更快的培训速度相同的性能。
Self-supervised pre-training of a speech foundation model, followed by supervised fine-tuning, has shown impressive quality improvements on automatic speech recognition (ASR) tasks. Fine-tuning separate foundation models for many downstream tasks are expensive since the foundation model is usually very big. Parameter-efficient fine-tuning methods (e.g. adapter, sparse update methods) offer an alternative paradigm where a small set of parameters are updated to adapt the foundation model to new tasks. However, these methods still suffer from a high computational memory cost and slow training speed because they require backpropagation through the entire neural network at each step. In the paper, we analyze the performance of features at different layers of a foundation model on the speech recognition task and propose a novel hierarchical feature fusion method for resource-efficient transfer learning from speech foundation models. Experimental results show that the proposed method can achieve better performance on speech recognition task than existing algorithms with fewer number of trainable parameters, less computational memory cost and faster training speed. After combining with Adapters at all layers, the proposed method can achieve the same performance as fine-tuning the whole model with $97\%$ fewer trainable encoder parameters and $53\%$ faster training speed.