论文标题
发现对话:野外的说话者诊断
Spot the conversation: speaker diarisation in the wild
论文作者
论文摘要
本文的目的是对“野外”收集的视频的扬声器诊断。我们做出三个关键贡献。首先,我们为YouTube视频提出了一种自动视听诊断方法。我们的方法包括使用视听方法和使用自我注册的扬声器模型的扬声器验证的主动扬声器检测。其次,我们将方法集成到半自动数据集创建管道中,该管道大大减少了用腹泻标签注释视频所需的小时数。最后,我们使用该管道创建一个称为VoxConverse的大规模诊断数据集,该数据集是从“在野外”视频中收集的,我们将向研究社区公开发布。我们的数据集由重叠的语音,一个大而多样的扬声器池以及具有挑战性的背景条件组成。
The goal of this paper is speaker diarisation of videos collected 'in the wild'. We make three key contributions. First, we propose an automatic audio-visual diarisation method for YouTube videos. Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic dataset creation pipeline which significantly reduces the number of hours required to annotate videos with diarisation labels. Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community. Our dataset consists of overlapping speech, a large and diverse speaker pool, and challenging background conditions.