论文标题
Covost 2和大量多语言语音到文本翻译
CoVoST 2 and Massively Multilingual Speech-to-Text Translation
论文作者
论文摘要
语音翻译最近已成为越来越流行的研究主题,部分原因是基准数据集的开发。但是,当前数据集涵盖了有限数量的语言。为了在低资源语言对的大规模多语言语音翻译和语音翻译中培养研究,我们发布了Covost 2,这是一种大规模的多语言语音翻译语料库,涵盖了从21种语言到英语的翻译,并从英语转化为15种语言。这是从总数和语言覆盖率角度来看最大的开放数据集。数据理智检查提供了有关数据质量的证据,该数据是根据CC0许可发布的。我们还提供具有开源实施的广泛的语音识别,双语和多语言的机器翻译和语音翻译基线。
Speech translation has recently become an increasingly popular topic of research, partly due to the development of benchmark datasets. Nevertheless, current datasets cover a limited number of languages. With the aim to foster research in massive multilingual speech translation and speech translation for low resource language pairs, we release CoVoST 2, a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages. This represents the largest open dataset available to date from total volume and language coverage perspective. Data sanity checks provide evidence about the quality of the data, which is released under CC0 license. We also provide extensive speech recognition, bilingual and multilingual machine translation and speech translation baselines with open-source implementation.