论文标题

SSCFORMER:使用依次采样的块和分块的因果卷积,推动块构象异构体的限制

SSCFormer: Push the Limit of Chunk-wise Conformer for Streaming ASR Using Sequentially Sampled Chunks and Chunked Causal Convolution

论文作者

Wang, Fangyuan, Xu, Bo, Xu, Bo

论文摘要

当前,块方案通常用于制作自动语音识别(ASR)模型以支持流部署。但是,现有的方法无法捕获全球环境,缺乏对并行训练的支持或表现出对多头自我注意力(MHSA)计算的二次复杂性。另一方面,因果卷积,没有未来的上下文,已成为流构象异构体中事实上的模块。在本文中,我们建议SSCFormer使用以下两种技术来推动块构象异构体的限制:1)一种新型的跨嵌式上下文生成方法,称为顺序采样块(SSC)方案,以重新分配块从常规分区的块中重新分配块,以促进有效的长期上下文相互作用,以促进有效的长期上下文相互作用。 2)块状因果卷积(C2Conv)旨在同时捕获左下文和块的未来上下文。对Aishell-1的评估表明,端到端(E2E)CER 5.33%可以实现,甚至表现优于强大的时间限制基线U2。此外,我们模型中的块MHSA计算可以使其具有较大的批次大小训练并以线性复杂性执行推理。

Currently, the chunk-wise schemes are often used to make Automatic Speech Recognition (ASR) models to support streaming deployment. However, existing approaches are unable to capture the global context, lack support for parallel training, or exhibit quadratic complexity for the computation of multi-head self-attention (MHSA). On the other side, the causal convolution, no future context used, has become the de facto module in streaming Conformer. In this paper, we propose SSCFormer to push the limit of chunk-wise Conformer for streaming ASR using the following two techniques: 1) A novel cross-chunks context generation method, named Sequential Sampling Chunk (SSC) scheme, to re-partition chunks from regular partitioned chunks to facilitate efficient long-term contextual interaction within local chunks. 2)The Chunked Causal Convolution (C2Conv) is designed to concurrently capture the left context and chunk-wise future context. Evaluations on AISHELL-1 show that an End-to-End (E2E) CER 5.33% can achieve, which even outperforms a strong time-restricted baseline U2. Moreover, the chunk-wise MHSA computation in our model enables it to train with a large batch size and perform inference with linear complexity.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源