在孟加拉语公共声音数据集上应用WAV2VEC2进行语音识别

论文标题

在孟加拉语公共声音数据集上应用WAV2VEC2进行语音识别

Applying wav2vec2 for Speech Recognition on Bengali Common Voices Dataset

论文作者

Shahgir, H. A. Z. Sameen, Sayeed, Khondker Salman, Zaman, Tanjeem Azwad

论文摘要

语音本质上是连续的，在这种情况下，离散的单词，音素和其他单元并未明确分割，因此，数十年来，语音识别一直是一个积极的研究问题。在这项工作中，我们有微调的WAV2VEC 2.0来识别和转录孟加拉语演讲 - 在孟加拉语Common Voice语音数据集中进行训练。在训练了71个时期之后，在由36919 MP3文件组成的训练组中，我们在尺寸为7,747的验证集上达到了0.3172的训练损失，WER为0.2524。使用5克语言模型，Levenshtein距离为7,747的测试集为2.6446。然后将训练集和验证集组合在一起，洗牌并分为85-15。在此组合数据集上对7个时期的7个时期的培训，在测试集中，Levenshtein的距离提高了2.60753。我们的模型是表现最好的模型，在隐藏的数据集上达到了Levenshtein的距离为6.234，该数据集比其他竞争提交的次数低1.1049单位。

Speech is inherently continuous, where discrete words, phonemes and other units are not clearly segmented, and so speech recognition has been an active research problem for decades. In this work we have fine-tuned wav2vec 2.0 to recognize and transcribe Bengali speech -- training it on the Bengali Common Voice Speech Dataset. After training for 71 epochs, on a training set consisting of 36919 mp3 files, we achieved a training loss of 0.3172 and WER of 0.2524 on a validation set of size 7,747. Using a 5-gram language model, the Levenshtein Distance was 2.6446 on a test set of size 7,747. Then the training set and validation set were combined, shuffled and split into 85-15 ratio. Training for 7 more epochs on this combined dataset yielded an improved Levenshtein Distance of 2.60753 on the test set. Our model was the best performing one, achieving a Levenshtein Distance of 6.234 on a hidden dataset, which was 1.1049 units lower than other competing submissions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题