会话短语扬声器诊断（CSSD）任务：数据集，评估指标和基线

论文标题

会话短语扬声器诊断（CSSD）任务：数据集，评估指标和基线

The Conversational Short-phrase Speaker Diarization (CSSD) Task: Dataset, Evaluation Metric and Baselines

论文作者

Cheng, Gaofeng, Chen, Yifan, Yang, Runyan, Li, Qingxuan, Yang, Zehui, Ye, Lingxuan, Zhang, Pengyuan, Zhang, Qingqing, Xie, Lei, Qian, Yanmin, Lee, Kong Aik, Yan, Yonghong

论文摘要

对话场景是语音处理技术最重要，最具挑战性的场景之一，因为对话中的人们以随意的方式相互反应。在对话中检测每个人的语音活动对于下游任务，例如自然语言处理，机器翻译等。人们指的是“在说话时”作为说话者诊断（SD）的检测技术。传统上，诊断错误率（DER）长期以来一直用作SD系统的标准评估度量。但是，der无法为简短的对话短语提供足够的重视，这在语义层面上很短，但很重要。此外，在语音社区中，精心且精确的手动测试数据集适用于评估对话性SD技术。在本文中，我们设计和描述了对话式短语扬声器诊断（CSSD）任务，该任务包括培训和测试数据集，评估指标和基准。在数据集方面，尽管先前开源的180小时对话魔术Data-RAMC数据集，但我们准备了一个20小时的对话演讲测试数据集，并精心验证的扬声器时间戳为CSSD任务注释。在度量方面，我们设计了新的对话der（CDER）评估度量，该评估度量计算出语音级别的SD准确性。在基线方面，我们采用了一种常用的方法：变异贝叶斯HMM X-vector系统，作为CSSD任务的基线。我们的评估指标可在https://github.com/speechclub/cder_metric上公开获得。

The conversation scenario is one of the most important and most challenging scenarios for speech processing technologies because people in conversation respond to each other in a casual style. Detecting the speech activities of each person in a conversation is vital to downstream tasks, like natural language processing, machine translation, etc. People refer to the detection technology of "who speak when" as speaker diarization (SD). Traditionally, diarization error rate (DER) has been used as the standard evaluation metric of SD systems for a long time. However, DER fails to give enough importance to short conversational phrases, which are short but important on the semantic level. Also, a carefully and accurately manually-annotated testing dataset suitable for evaluating the conversational SD technologies is still unavailable in the speech community. In this paper, we design and describe the Conversational Short-phrases Speaker Diarization (CSSD) task, which consists of training and testing datasets, evaluation metric and baselines. In the dataset aspect, despite the previously open-sourced 180-hour conversational MagicData-RAMC dataset, we prepare an individual 20-hour conversational speech test dataset with carefully and artificially verified speakers timestamps annotations for the CSSD task. In the metric aspect, we design the new conversational DER (CDER) evaluation metric, which calculates the SD accuracy at the utterance level. In the baseline aspect, we adopt a commonly used method: Variational Bayes HMM x-vector system, as the baseline of the CSSD task. Our evaluation metric is publicly available at https://github.com/SpeechClub/CDER_Metric.

下载PDF全文

下载文献需遵守相关版权规定

论文标题