论文标题
适应端到端的语音识别可读字幕
Adapting End-to-End Speech Recognition for Readable Subtitles
论文作者
论文摘要
自动语音识别(ASR)系统主要根据转录精度进行评估。但是,在某些用例中,例如字幕,逐字转录将在屏幕尺寸和阅读时间有限的情况下降低输出可读性。因此,这项工作集中在ASR上,并通过输出压缩,这是由于培训数据缺乏而挑战监督方法的任务。我们首先研究了一个级联系统,该系统使用无监督的压缩模型来编辑转录的语音。然后,我们比较在输出长度约束下端到端语音识别的几种方法。该实验表明,由于有限的数据远远低于从头开始训练模型所需的,我们可以调整基于变压器的ASR模型以同时结合转录和压缩功能。此外,从WER和Rouge得分方面的最佳性能是通过对端到端ASR系统内的长度约束进行建模来实现的。
Automatic speech recognition (ASR) systems are primarily evaluated on transcription accuracy. However, in some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time. Therefore, this work focuses on ASR with output compression, a task challenging for supervised approaches due to the scarcity of training data. We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech. We then compare several methods of end-to-end speech recognition under output length constraints. The experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities. Furthermore, the best performance in terms of WER and ROUGE scores is achieved by explicitly modeling the length constraints within the end-to-end ASR system.