CVSS语料库和大量多语言语音到语音翻译

论文标题

CVSS语料库和大量多语言语音到语音翻译

CVSS Corpus and Massively Multilingual Speech-to-Speech Translation

论文作者

Jia, Ye, Ramanovich, Michelle Tadmor, Wang, Quan, Zen, Heiga

论文摘要

我们介绍了CVSS，这是一种大量的多语言到英语语音到语音翻译（S2ST）语料库，涵盖了21种语言的句子级并行S2ST对英语。 CVSS来自通用语音语音语料库和Covost 2语音转换（ST）语料库，通过将Covost 2的翻译文本合成使用最先进的TTS系统中的语音。提供了两个版本的翻译演讲：1）CVSS-C：所有翻译演讲都以单个高质量的规范语音； 2）CVSS-T：翻译演讲是从相应的源语音转移的声音中。此外，CVSS提供了归一化的翻译文本，该文本与翻译语音中的发音匹配。在每个版本的CVSS上，我们构建了基线多语言直接S2ST模型和Cascade S2ST模型，从而验证了语料库的有效性。为了构建强大的Cascade S2ST基线，我们在Covost 2上训练了ST模型，该模型的表现优于先前在语料库中训练的先前最先进的，而没有5.8 BLEU的额外数据。然而，直接S2ST型号的性能在从头开始训练时接近强级联基线，而从匹配的ST模型初始化时，ASR转录翻译的0.1或0.7 BLEU差异。

We introduce CVSS, a massively multilingual-to-English speech-to-speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems. Two versions of translation speeches are provided: 1) CVSS-C: All the translation speeches are in a single high-quality canonical voice; 2) CVSS-T: The translation speeches are in voices transferred from the corresponding source speeches. In addition, CVSS provides normalized translation text which matches the pronunciation in the translation speech. On each version of CVSS, we built baseline multilingual direct S2ST models and cascade S2ST models, verifying the effectiveness of the corpus. To build strong cascade S2ST baselines, we trained an ST model on CoVoST 2, which outperforms the previous state-of-the-art trained on the corpus without extra data by 5.8 BLEU. Nevertheless, the performance of the direct S2ST models approaches the strong cascade baselines when trained from scratch, and with only 0.1 or 0.7 BLEU difference on ASR transcribed translation when initialized from matching ST models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题