论文标题

缩小单用户和多用户语音窗口之间的差距

Closing the Gap between Single-User and Multi-User VoiceFilter-Lite

论文作者

Rikhye, Rajeev, Wang, Quan, Liang, Qiao, He, Yanzhang, McGraw, Ian

论文摘要

VoiceFilter-Lite是一种由扬声器条件的语音分离模型,它通过抑制非目标扬声器的重叠语音来改善语音识别和说话者验证中起着至关重要的作用。但是,语音传说 - 窗口和其他由说话者条件的语音模型的局限性通常仅限于单个目标扬声器。这是不可取的,因为大多数智能家居设备现在都支持多个注册用户。为了将个性化的好处扩展到多个用户,我们以前开发了一种基于注意力的扬声器选择机制,并将其应用于语音窗口。但是,与单用户模型相比,原始的多用户语音窗模型遭受了显着的性能降解。在本文中,我们设计了一系列实验,以改善多用户语音窗口模型。通过合并双重学习率计划,并使用特征线性调制(膜)通过参加的扬声器嵌入来调节模型,我们成功地缩小了单语言评估中多用户和单用户语音透光器模型之间的性能差距。同时,新模型也可以轻松扩展以支持任何数量的用户,并且大大优于我们先前发布的关于多演讲者评估的模型。

VoiceFilter-Lite is a speaker-conditioned voice separation model that plays a crucial role in improving speech recognition and speaker verification by suppressing overlapping speech from non-target speakers. However, one limitation of VoiceFilter-Lite, and other speaker-conditioned speech models in general, is that these models are usually limited to a single target speaker. This is undesirable as most smart home devices now support multiple enrolled users. In order to extend the benefits of personalization to multiple users, we previously developed an attention-based speaker selection mechanism and applied it to VoiceFilter-Lite. However, the original multi-user VoiceFilter-Lite model suffers from significant performance degradation compared with single-user models. In this paper, we devised a series of experiments to improve the multi-user VoiceFilter-Lite model. By incorporating a dual learning rate schedule and by using feature-wise linear modulation (FiLM) to condition the model with the attended speaker embedding, we successfully closed the performance gap between multi-user and single-user VoiceFilter-Lite models on single-speaker evaluations. At the same time, the new model can also be easily extended to support any number of users, and significantly outperforms our previously published model on multi-speaker evaluations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源