避免在自我监督的模型中过度思考语音识别

论文标题

避免在自我监督的模型中过度思考语音识别

Avoid Overthinking in Self-Supervised Models for Speech Recognition

论文作者

Berrebbi, Dan, Yan, Brian, Watanabe, Shinji

论文摘要

自我监督学习（SSL）模型重塑了我们的语言，语言和愿景的方法。但是，它们的巨大规模以及其层次和任务之间的不透明关系导致推理缓慢和网络过度思考，因为大型模型的最后一层的预测比中间层的预测更糟糕。早期出口（EE）策略可以通过在推理时间在某些样本的推理时间动态减少计算来解决这两个问题。尽管在视觉和语言中的分类任务中很受欢迎，但EE在序列到序列语音识别（ASR）任务中的使用较少，在这些任务中，早期层的输出通常是退化的。当将语音SSL模型应用于分布（OOD）数据时，这一挑战将进一步加重。本文首先表明SSL模型在ASR中确实过度思考。然后，我们通过计算最佳性能与速度折衷的最佳界限来激励EE进一步研究。为了解决这一约束，我们提出了两种新的ASR策略：（1）我们将最近提出的耐心策略适应ASR；（2）我们设计了一种针对ASR的新EE策略，其性能比以前引入的所有策略更好。

Self-supervised learning (SSL) models reshaped our approach to speech, language and vision. However their huge size and the opaque relations between their layers and tasks result in slow inference and network overthinking, where predictions made from the last layer of large models is worse than those made from intermediate layers. Early exit (EE) strategies can solve both issues by dynamically reducing computations at inference time for certain samples. Although popular for classification tasks in vision and language, EE has seen less use for sequence-to-sequence speech recognition (ASR) tasks where outputs from early layers are often degenerate. This challenge is further compounded when speech SSL models are applied on out-of-distribution (OOD) data. This paper first shows that SSL models do overthinking in ASR. We then motivate further research in EE by computing an optimal bound for performance versus speed trade-offs. To approach this bound we propose two new strategies for ASR: (1) we adapt the recently proposed patience strategy to ASR; and (2) we design a new EE strategy specific to ASR that performs better than all strategies previously introduced.

下载PDF全文

下载文献需遵守相关版权规定

论文标题