深度稀疏的构象异构体以识别语音

论文标题

深度稀疏的构象异构体以识别语音

Deep Sparse Conformer for Speech Recognition

论文作者

Wu, Xianchao

论文摘要

通过利用变形金刚捕获基于内容的全球互动和卷积神经网络对本地特征的利用，构象异构体在自动语音识别（ASR）方面取得了令人印象深刻的结果。在构象异构体中，两个具有半步长的剩余连接的类似马卡龙的进料层将多头的自我注意和卷积模块夹在一起，然后是后层的归一化。我们在两个方向上提高了构象异构器的长期表示能力，即\ emph {sparser}和\ emph {更深的}。我们使用$ \ Mathcal {o}（l \ text {log} l）$在时间复杂性和内存使用情况下调整稀疏的自我发挥机制。在执行剩余连接以确保我们对一百级构象体块的训练时，将使用深层的归一化策略。在日本CSJ-500H数据集上，这种深稀疏的构象体在三个评估集中分别达到5.52 \％，4.03 \％和4.50 \％的CERS，以及4.16 \％，2.84 \％\％和3.20 \％的c时，当结合五个深度稀疏的配合物变体的12至16、16、17、17、17、17、17、17、10、10、50、50、50、50、50、50、50、50、50、50、10、10、10、10、50、50、50、50、50、50、50、50、50、10，100，以及

Conformer has achieved impressive results in Automatic Speech Recognition (ASR) by leveraging transformer's capturing of content-based global interactions and convolutional neural network's exploiting of local features. In Conformer, two macaron-like feed-forward layers with half-step residual connections sandwich the multi-head self-attention and convolution modules followed by a post layer normalization. We improve Conformer's long-sequence representation ability in two directions, \emph{sparser} and \emph{deeper}. We adapt a sparse self-attention mechanism with $\mathcal{O}(L\text{log}L)$ in time complexity and memory usage. A deep normalization strategy is utilized when performing residual connections to ensure our training of hundred-level Conformer blocks. On the Japanese CSJ-500h dataset, this deep sparse Conformer achieves respectively CERs of 5.52\%, 4.03\% and 4.50\% on the three evaluation sets and 4.16\%, 2.84\% and 3.20\% when ensembling five deep sparse Conformer variants from 12 to 16, 17, 50, and finally 100 encoder layers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题