端到端的扬声器诊断，用于基于编码器的吸引者的未知数扬声器

论文标题

端到端的扬声器诊断，用于基于编码器的吸引者的未知数扬声器

End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors

论文作者

Horiguchi, Shota, Fujita, Yusuke, Watanabe, Shinji, Xue, Yawen, Nagamatsu, Kenji

论文摘要

本文介绍了针对未知数量的扬声器的端到端扬声器诊断。最近提议的端到端说话者诊断的表现优于基于聚类的扬声器诊断，但它的缺点是：在说话者的数量方面，它的灵活性较小。本文提出了一种基于编码器的吸引子计算（EDA）的方法，该方法首先从语音嵌入序列中生成灵活数量的吸引子。然后，生成的多个吸引子乘以语音嵌入顺序，以产生相同数量的扬声器活动。使用常规的自动端对端神经说话者诊断（SA-EEND）网络提取语音嵌入序列。在两个扬声器的条件下，我们的方法在模拟混合物上达到了2.69％的诊断错误率（DER），在Callhome的两扬声器子集上达到了8.07％的DER，而Vanilla Sa-Eend分别获得了4.56％和9.54％。在未知数量的扬声器条件下，我们的方法在Callhome上达到了15.29％的DER，而基于X-Vector的聚类方法的达到了19.43％的DER。

End-to-end speaker diarization for an unknown number of speakers is addressed in this paper. Recently proposed end-to-end speaker diarization outperformed conventional clustering-based speaker diarization, but it has one drawback: it is less flexible in terms of the number of speakers. This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a flexible number of attractors from a speech embedding sequence. Then, the generated multiple attractors are multiplied by the speech embedding sequence to produce the same number of speaker activities. The speech embedding sequence is extracted using the conventional self-attentive end-to-end neural speaker diarization (SA-EEND) network. In a two-speaker condition, our method achieved a 2.69 % diarization error rate (DER) on simulated mixtures and a 8.07 % DER on the two-speaker subset of CALLHOME, while vanilla SA-EEND attained 4.56 % and 9.54 %, respectively. In unknown numbers of speakers conditions, our method attained a 15.29 % DER on CALLHOME, while the x-vector-based clustering method achieved a 19.43 % DER.

下载PDF全文

下载文献需遵守相关版权规定

论文标题