使用WAV2VEC 2.0在语音中检测韵律边界

论文标题

使用WAV2VEC 2.0在语音中检测韵律边界

Detection of Prosodic Boundaries in Speech Using Wav2Vec 2.0

论文作者

Kunešová, Marie, Řezáčková, Markéta

论文摘要

言语中的韵律边界与语音综合和音频注释都有很大相关。在本文中，我们将WAV2VEC 2.0框架应用于仅使用声学信息在语音信号中检测这些边界的任务。我们在捷克广播新闻的一组录音中测试了该方法，该新闻标记为语音专家，并将其与现有的基于文本的预测变量进行比较，该预测指标使用了相同数据的成绩单。尽管使用了相对少量的标记数据，但WAV2VEC2模型的精度为94％，F1量度为83％的刑罚性韵律边界（或所有韵律边界上的95％和89％）的精度为83％，表现优于基于文本的方法。但是，通过结合两个不同模型的输出，我们可以进一步改善结果。

Prosodic boundaries in speech are of great relevance to both speech synthesis and audio annotation. In this paper, we apply the wav2vec 2.0 framework to the task of detecting these boundaries in speech signal, using only acoustic information. We test the approach on a set of recordings of Czech broadcast news, labeled by phonetic experts, and compare it to an existing text-based predictor, which uses the transcripts of the same data. Despite using a relatively small amount of labeled data, the wav2vec2 model achieves an accuracy of 94% and F1 measure of 83% on within-sentence prosodic boundaries (or 95% and 89% on all prosodic boundaries), outperforming the text-based approach. However, by combining the outputs of the two different models we can improve the results even further.

下载PDF全文

下载文献需遵守相关版权规定

论文标题