使用异质培训批处理组件生成说话者表示

论文标题

使用异质培训批处理组件生成说话者表示

Generation of Speaker Representations Using Heterogeneous Training Batch Assembly

论文作者

Peng, Yu-Huai, Lee, Hung-Shin, Huang, Pin-Tuan, Wang, Hsin-Min

论文摘要

在传统的说话者诊断系统中，训练有素的扬声器模型是在漫长的语音会话中从连续和部分重叠段中提取表示形式的关键组成部分。为了更符合后端细分和聚类，我们提出了一种新的基于CNN的扬声器建模方案，该方案考虑了每个培训细分和批次中说话者的异质性。我们将培训数据随机和合成地扩展到一组部分中，每个细分都包含多个扬声器和一些重叠的部分。根据其说话者职业比率对每个细分市场施加了软标签，并且在模型训练中实现了标准的横熵损失。这样，扬声器模型应具有为每个多演讲者段生成几何有意义的嵌入的能力。实验结果表明，我们的系统在两个说话者诊断任务中使用X-矢量优于基线系统。在接受NIST SRE和总机数据集训练的Callhome任务中，我们的系统在DER中的相对减少了12.93％。在Chime-6的轨道2中，我们的系统分别提供13.24％，12.60％和5.65％的DER，JER和WER。

In traditional speaker diarization systems, a well-trained speaker model is a key component to extract representations from consecutive and partially overlapping segments in a long speech session. To be more consistent with the back-end segmentation and clustering, we propose a new CNN-based speaker modeling scheme, which takes into account the heterogeneity of the speakers in each training segment and batch. We randomly and synthetically augment the training data into a set of segments, each of which contains more than one speaker and some overlapping parts. A soft label is imposed on each segment based on its speaker occupation ratio, and the standard cross entropy loss is implemented in model training. In this way, the speaker model should have the ability to generate a geometrically meaningful embedding for each multi-speaker segment. Experimental results show that our system is superior to the baseline system using x-vectors in two speaker diarization tasks. In the CALLHOME task trained on the NIST SRE and Switchboard datasets, our system achieves a relative reduction of 12.93% in DER. In Track 2 of CHiME-6, our system provides 13.24%, 12.60%, and 5.65% relative reductions in DER, JER, and WER, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题