论文标题

vadoi:端到端长形语音识别的语音 - 活动检测重叠推论

VADOI:Voice-Activity-Detection Overlapping Inference For End-to-end Long-form Speech Recognition

论文作者

Wang, Jinhan, Tong, Xiaosu, Guo, Jinxi, He, Di, Maas, Roland

论文摘要

尽管端到端的模型在自动语音识别任务上表现出了巨大的成功,但当目标句子长期形式时,性能会严重降低。先前提出的方法,(部分)重叠推断对长形式解码有效。对于这两种方法,当重叠百分比降低时,单词错误率(WER)单调降低。搁置计算成本,推理期间有50%重叠的设置可以实现最佳性能。但是,较低的重叠百分比具有快速推理速度的优势。在本文中,我们首先进行了全面的实验,将重叠推断和部分重叠推断与各种配置进行比较。然后,我们提出语音攻击性检测重叠推论,以在计算成本和计算成本之间提供权衡。结果表明,所提出的方法可以在LibrisPeech和Microsoft语言翻译上实现20%的相对计算成本降低长形式语料库,同时与最佳性能性能重叠推断算法进行比较时保持了性能。我们还提出了软匹配,以补偿类似的单词错误的问题。

While end-to-end models have shown great success on the Automatic Speech Recognition task, performance degrades severely when target sentences are long-form. The previous proposed methods, (partial) overlapping inference are shown to be effective on long-form decoding. For both methods, word error rate (WER) decreases monotonically when overlapping percentage decreases. Setting aside computational cost, the setup with 50% overlapping during inference can achieve the best performance. However, a lower overlapping percentage has an advantage of fast inference speed. In this paper, we first conduct comprehensive experiments comparing overlapping inference and partial overlapping inference with various configurations. We then propose Voice-Activity-Detection Overlapping Inference to provide a trade-off between WER and computation cost. Results show that the proposed method can achieve a 20% relative computation cost reduction on Librispeech and Microsoft Speech Language Translation long-form corpus while maintaining the WER performance when comparing to the best performing overlapping inference algorithm. We also propose Soft-Match to compensate for similar words mis-aligned problem.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源