论文标题
对端到端语音识别的神经表示的见解
Insights on Neural Representations for End-to-End Speech Recognition
论文作者
论文摘要
端到端的自动语音识别(ASR)模型旨在学习通用的语音表示。但是,有限的工具可用于了解模型体系结构中层次依赖性的内部功能和效果。了解层表表示之间的相关性,从而获得有关神经表示与性能之间关系的见解至关重要。 先前尚未针对端到端ASR模型探索使用相关分析技术对网络相似性的研究。本文使用规范相关分析(CCA)(CCA)和中心的内核比对(CKA)分析并探讨了使用CNN,LSTM和Transformer基于CNN,LSTM和Transformer的方法之间的内部动力学。发现CNN层中的神经表示随着层深度的增加而表现出层次相关性依赖性,但这主要仅限于神经表示更紧密相关的情况。在LSTM体系结构中未观察到这种行为,但是在整个训练过程中观察到了自下而上的模式,而变压器编码器层随着神经深度的增加而显示出不规则的系数相关性。总的来说,这些结果为神经体系结构在语音识别表现中的作用提供了新的见解。更具体地说,这些技术可以用作建立更好性能语音识别模型的指标。
End-to-end automatic speech recognition (ASR) models aim to learn a generalised speech representation. However, there are limited tools available to understand the internal functions and the effect of hierarchical dependencies within the model architecture. It is crucial to understand the correlations between the layer-wise representations, to derive insights on the relationship between neural representations and performance. Previous investigations of network similarities using correlation analysis techniques have not been explored for End-to-End ASR models. This paper analyses and explores the internal dynamics between layers during training with CNN, LSTM and Transformer based approaches using Canonical correlation analysis (CCA) and centered kernel alignment (CKA) for the experiments. It was found that neural representations within CNN layers exhibit hierarchical correlation dependencies as layer depth increases but this is mostly limited to cases where neural representation correlates more closely. This behaviour is not observed in LSTM architecture, however there is a bottom-up pattern observed across the training process, while Transformer encoder layers exhibit irregular coefficiency correlation as neural depth increases. Altogether, these results provide new insights into the role that neural architectures have upon speech recognition performance. More specifically, these techniques can be used as indicators to build better performing speech recognition models.