论文标题

自我监督的说话者诊断

Self-supervised Speaker Diarization

论文作者

Dissen, Yehoshua, Kreuk, Felix, Keshet, Joseph

论文摘要

在过去的几年中,深度学习在说话者验证,识别和诊断方面越来越受欢迎。毫无疑问,这一成功的很大一部分是由于其说话者表示的有效性。但是,这些很大程度上取决于大量注释的数据,并且可能对新域敏感。这项研究提出了一个完全无监督的深度学习模型,用于诊断者。具体而言,该研究的重点是生成没有任何注释数据的高质量神经说话者表示,以及估计模型的二级超参数而无需注释。 扬声器的嵌入方式是用一个以自我监督方式训练的编码器,该编码器使用假定为同一扬声器的相邻段对。然后,经过训练的编码器模型用于自生伪标签,随后使用概率线性判别分析(PLDA)在同一呼叫的不同段之间训练相似性得分,并进一步学习群集停止阈值。我们将我们的模型与无监督的最先进的模型以及在呼号基准上进行了监督的基线。根据经验结果,当通话中只有两个说话者时,我们的方法优于无监督的方法,并且比最近的监督模型稍差。

Over the last few years, deep learning has grown in popularity for speaker verification, identification, and diarization. Inarguably, a significant part of this success is due to the demonstrated effectiveness of their speaker representations. These, however, are heavily dependent on large amounts of annotated data and can be sensitive to new domains. This study proposes an entirely unsupervised deep-learning model for speaker diarization. Specifically, the study focuses on generating high-quality neural speaker representations without any annotated data, as well as on estimating secondary hyperparameters of the model without annotations. The speaker embeddings are represented by an encoder trained in a self-supervised fashion using pairs of adjacent segments assumed to be of the same speaker. The trained encoder model is then used to self-generate pseudo-labels to subsequently train a similarity score between different segments of the same call using probabilistic linear discriminant analysis (PLDA) and further to learn a clustering stopping threshold. We compared our model to state-of-the-art unsupervised as well as supervised baselines on the CallHome benchmarks. According to empirical results, our approach outperforms unsupervised methods when only two speakers are present in the call, and is only slightly worse than recent supervised models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源