使用多任务学习来利用扬声器属性信息进行扬声器验证和诊断

论文标题

使用多任务学习来利用扬声器属性信息进行扬声器验证和诊断

Leveraging speaker attribute information using multi task learning for speaker verification and diarization

论文作者

Luu, Chau, Bell, Peter, Renals, Steve

论文摘要

Deep Speaker嵌入已成为在说话者识别任务中编码说话者身份的领先方法。理想情况下，嵌入空间应捕获所有可能的扬声器之间的变化，编码构成扬声器身份的多个声学方面，同时对非扬声器声学变化具有鲁棒性。扬声器的嵌入通常是歧视训练的，可以在培训数据上预测说话者的身份标签。我们假设，还可以预测与说话者相关的辅助变量（例如年龄和国籍）可能会产生能够更好地推广到看不见的说话者的代表。我们建议使用辅助标签信息的框架，即使仅可用于与目标应用程序不匹配的语音Corpora。在美国最高法院的录音中，我们表明，通过利用匹配的培训数据和voxceleb语料库分别派生的另外形式的扬声器属性信息，我们提高了深度扬声器嵌入的性能，以实现验证和诊断任务，从而在der和6.7％的Eere eerebalines sepliness sepliness中相对提高了26.2％的相对提高。尽管辅助标签已经从网络上刮下来并且可能嘈杂，但仍获得了这种改进。

Deep speaker embeddings have become the leading method for encoding speaker identity in speaker recognition tasks. The embedding space should ideally capture the variations between all possible speakers, encoding the multiple acoustic aspects that make up a speaker's identity, whilst being robust to non-speaker acoustic variation. Deep speaker embeddings are normally trained discriminatively, predicting speaker identity labels on the training data. We hypothesise that additionally predicting speaker-related auxiliary variables -- such as age and nationality -- may yield representations that are better able to generalise to unseen speakers. We propose a framework for making use of auxiliary label information, even when it is only available for speech corpora mismatched to the target application. On a test set of US Supreme Court recordings, we show that by leveraging two additional forms of speaker attribute information derived respectively from the matched training data, and VoxCeleb corpus, we improve the performance of our deep speaker embeddings for both verification and diarization tasks, achieving a relative improvement of 26.2% in DER and 6.7% in EER compared to baselines using speaker labels only. This improvement is obtained despite the auxiliary labels having been scraped from the web and being potentially noisy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题