LSTM和GPT-2综合语音转移学习说话者识别以克服数据稀缺

论文标题

LSTM和GPT-2综合语音转移学习说话者识别以克服数据稀缺

LSTM and GPT-2 Synthetic Speech Transfer Learning for Speaker Recognition to Overcome Data Scarcity

论文作者

Bird, Jordan J., Faria, Diego R., Ekárt, Anikó, Premebida, Cristiano, Ayrosa, Pedro P. S.

论文摘要

在语音识别问题中，由于人类愿意提供大量的学习和分类数据，数据稀缺通常会提出问题。在这项工作中，我们从7个主题中摘取了5个哈佛大学的句子，并考虑其MFCC属性。使用字符级别的LSTM（有监督的学习）和OpenAI基于注意力的GPT-2模型，合成MFCC是通过根据人均提供的数据来生成的。对神经网络进行了训练，可以针对flickr8k扬声器的大数据集对数据进行分类，然后将其与执行相同任务的转移学习网络进行比较，但最初的重量分布决定了两个模型生成的合成数据。所有7个受试者的最佳结果是暴露于合成数据的网络，该模型通过LSTM生产的数据预先训练，获得了最佳结果3次，而GPT-2同等结果（因为一个受试者在两个模型中都获得了最佳结果）。通过这些结果，我们认为可以通过使用少量用户数据来改善说话者的分类，但会接触合成生成的MFCC，然后允许网络获得接近最大的分类分数。

In speech recognition problems, data scarcity often poses an issue due to the willingness of humans to provide large amounts of data for learning and classification. In this work, we take a set of 5 spoken Harvard sentences from 7 subjects and consider their MFCC attributes. Using character level LSTMs (supervised learning) and OpenAI's attention-based GPT-2 models, synthetic MFCCs are generated by learning from the data provided on a per-subject basis. A neural network is trained to classify the data against a large dataset of Flickr8k speakers and is then compared to a transfer learning network performing the same task but with an initial weight distribution dictated by learning from the synthetic data generated by the two models. The best result for all of the 7 subjects were networks that had been exposed to synthetic data, the model pre-trained with LSTM-produced data achieved the best result 3 times and the GPT-2 equivalent 5 times (since one subject had their best result from both models at a draw). Through these results, we argue that speaker classification can be improved by utilising a small amount of user data but with exposure to synthetically-generated MFCCs which then allow the networks to achieve near maximum classification scores.

下载PDF全文

下载文献需遵守相关版权规定

论文标题