对说话者识别的分层注意网络的弱监督培训

论文标题

对说话者识别的分层注意网络的弱监督培训

Weakly Supervised Training of Hierarchical Attention Networks for Speaker Identification

论文作者

Shi, Yanpei, Huang, Qiang, Hain, Thomas

论文摘要

在不知道说话者在录音中的声音在哪里可以识别多个扬声器是一项具有挑战性的任务。在本文中，提出了一个分层注意网络来解决一个弱标记的说话者识别问题。层次结构的使用，该结构由框架级编码器和细分级编码器组成，旨在在本地和全球学习扬声器相关的信息。语音流分为片段。带有注意力的框架级编码器学习特征并在本地突出显示目标相关的帧，并输出基于片段的嵌入。细分级编码器与第二个注意力层一起工作，以强调可能与目标扬声器相关的片段。最终从细分级模块收集了全局信息，以通过分类器预测说话者。为了评估所提出的方法的有效性，基于总机蜂窝部分1（SWBC）和voxceleb1的人工数据集在两个条件下构建，其中说话者的声音被重叠而不是重叠。与两个基准相比，获得的结果表明，所提出的方法可以实现更好的性能。此外，还进行了进一步的实验来评估话语分割的影响。结果表明，合理的分割可以稍微改善识别性能。

Identifying multiple speakers without knowing where a speaker's voice is in a recording is a challenging task. In this paper, a hierarchical attention network is proposed to solve a weakly labelled speaker identification problem. The use of a hierarchical structure, consisting of a frame-level encoder and a segment-level encoder, aims to learn speaker related information locally and globally. Speech streams are segmented into fragments. The frame-level encoder with attention learns features and highlights the target related frames locally, and output a fragment based embedding. The segment-level encoder works with a second attention layer to emphasize the fragments probably related to target speakers. The global information is finally collected from segment-level module to predict speakers via a classifier. To evaluate the effectiveness of the proposed approach, artificial datasets based on Switchboard Cellular part1 (SWBC) and Voxceleb1 are constructed in two conditions, where speakers' voices are overlapped and not overlapped. Comparing to two baselines the obtained results show that the proposed approach can achieve better performances. Moreover, further experiments are conducted to evaluate the impact of utterance segmentation. The results show that a reasonable segmentation can slightly improve identification performances.

下载PDF全文

下载文献需遵守相关版权规定

论文标题