论文标题
在不变的语音识别中解开
Untangling in Invariant Speech Recognition
论文作者
论文摘要
在深层神经网络在各种视觉任务上的成功的鼓励下,许多理论和实验性工作旨在理解和解释视力网络的运作方式。同时,深层神经网络在音频处理应用程序中也取得了令人印象深刻的性能,既是较大系统的子组件,又是本身作为完整的端到端系统。尽管取得了经验的成功,但对于这些音频模型如何完成这些任务而言,相对较少的理解。在这项工作中,我们采用了最近开发的统计机械理论,该理论将网络表示的几何特性和类别的分离性连接起来,以探测信息如何在训练识别语音的神经网络中毫无障碍。我们观察到说话者特定的滋扰变化被网络的层次结构丢弃,而与任务相关的属性(如单词和音素)在后来的层中没有扭曲。更高级别的概念(例如词性和上下文依赖性)也会在网络的后期层中出现。最后,我们发现,深度表示在计算的每个时间步骤中有效提取与任务相关的特征,进行了重要的时间无关紧要。综上所述,这些发现阐明了深度听觉模型如何处理时间依赖性输入信号以实现不变的语音识别,并展示了不同概念如何通过网络的层次出现。
Encouraged by the success of deep neural networks on a variety of visual tasks, much theoretical and experimental work has been aimed at understanding and interpreting how vision networks operate. Meanwhile, deep neural networks have also achieved impressive performance in audio processing applications, both as sub-components of larger systems and as complete end-to-end systems by themselves. Despite their empirical successes, comparatively little is understood about how these audio models accomplish these tasks. In this work, we employ a recently developed statistical mechanical theory that connects geometric properties of network representations and the separability of classes to probe how information is untangled within neural networks trained to recognize speech. We observe that speaker-specific nuisance variations are discarded by the network's hierarchy, whereas task-relevant properties such as words and phonemes are untangled in later layers. Higher level concepts such as parts-of-speech and context dependence also emerge in the later layers of the network. Finally, we find that the deep representations carry out significant temporal untangling by efficiently extracting task-relevant features at each time step of the computation. Taken together, these findings shed light on how deep auditory models process time dependent input signals to achieve invariant speech recognition, and show how different concepts emerge through the layers of the network.