2021 NIST扬声器识别评估

论文标题

2021 NIST扬声器识别评估

The 2021 NIST Speaker Recognition Evaluation

论文作者

Sadjadi, Seyed Omid, Greenberg, Craig, Singer, Elliot, Mason, Lisa, Reynolds, Douglas

论文摘要

2021年的演讲者认可评估（SRE21）是美国国家标准技术研究所（NIST）进行的正在进行的评估系列的最新周期。这是NIST组织的第二个大型多模式说话者/人识别评估（第一个是SRE19）。类似于SRE19，它具有两个核心评估轨道，即音频和视听视频，以及一个可选的视觉轨道。除了提供固定和开放式训练条件外，由于新的多模式（即音频，视频和自拍照图像）以及多语言（即具有多语言扬声器）语料库，它被称为Wecantalk，称为Wecantalk，被称为Wecantalk，该新的多种模态（即，被称为Wecantalk）被称为语言数据联合会（LDC）。这些挑战包括：1）源自不同领域（即电话与视频）的招生和测试段的试验（目标和非目标），以及2）试验（目标和非目标）（目标和非目标），并以不同的语言（即跨语言试验）进行注册和测试片段。本文介绍了SRE21的概述，包括任务，性能指标，数据，评估协议，结果和系统性能分析。来自学术界和行业的23个组织（组成15个团队）参加了SRE21，并提交了158个有效的系统输出。评估结果表明：视听融合在仅视听或仅视觉系统的性能上带来了可观的增长；在本评估中存在的匹配领域条件下，表现最高的扬声器和面部识别系统表现出可比的性能；而且，使用复杂的神经网络体系结构（例如，重新连接）以及边缘损失，数据增强以及长期的持续时间进行微调，促进了仅供音频扬声器识别任务的显着性能。

The 2021 Speaker Recognition Evaluation (SRE21) was the latest cycle of the ongoing evaluation series conducted by the U.S. National Institute of Standards and Technology (NIST) since 1996. It was the second large-scale multimodal speaker/person recognition evaluation organized by NIST (the first one being SRE19). Similar to SRE19, it featured two core evaluation tracks, namely audio and audio-visual, as well as an optional visual track. In addition to offering fixed and open training conditions, it also introduced new challenges for the community, thanks to a new multimodal (i.e., audio, video, and selfie images) and multilingual (i.e., with multilingual speakers) corpus, termed WeCanTalk, collected outside North America by the Linguistic Data Consortium (LDC). These challenges included: 1) trials (target and non-target) with enrollment and test segments originating from different domains (i.e., telephony versus video), and 2) trials (target and non-target) with enrollment and test segments spoken in different languages (i.e., cross-lingual trials). This paper presents an overview of SRE21 including the tasks, performance metric, data, evaluation protocol, results and system performance analyses. A total of 23 organizations (forming 15 teams) from academia and industry participated in SRE21 and submitted 158 valid system outputs. Evaluation results indicate: audio-visual fusion produce substantial gains in performance over audio-only or visual-only systems; top performing speaker and face recognition systems exhibited comparable performance under the matched domain conditions present in this evaluation; and, the use of complex neural network architectures (e.g., ResNet) along with angular losses with margin, data augmentation, as well as long duration fine-tuning contributed to notable performance improvements for the audio-only speaker recognition task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题