Transpeech：双边扰动的语音到语音翻译

论文标题

Transpeech：双边扰动的语音到语音翻译

TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation

论文作者

Huang, Rongjie, Liu, Jinglin, Liu, Huadai, Ren, Yi, Zhang, Lichao, He, Jinzheng, Zhao, Zhou

论文摘要

具有离散单位的直接语音到语音翻译（S2ST）利用语音表示学习的最新进展。 Specifically, a sequence of discrete representations derived in a self-supervised manner are predicted from the model and passed to a vocoder for speech reconstruction, while still facing the following challenges: 1) Acoustic multimodality: the discrete units derived from speech with same content could be indeterministic due to the acoustic property (e.g., rhythm, pitch, and energy), which causes deterioration of translation accuracy; 2）高潜伏期：当前的S2ST系统使用自回归模型，这些模型可以预测以先前生成的序列为条件的每个单元，但未能充分利用并行性。在这项工作中，我们提出了Transpeech，这是一种具有双侧扰动的语音转换模型。为了减轻声学多模式问题，我们提出了由样式的标准化和信息增强阶段组成的双边扰动（BIP），仅从语音样本中学习语言信息并产生更确定性的表示。随着多模式的降低，我们向前迈进，成为第一个建立非自动性S2ST技术的人，该技术反复掩盖并预测单位选择并产生高智能，导致几个周期。三种语言对的实验结果表明，与基线无文本S2ST模型相比，BIP平均得出2.9 BLEU。此外，我们的平行解码显示了推理潜伏期的显着降低，比自回归技术的加速高达21.4倍。音频样本可在\ url {https://transpeech.github.io/}上找到

Direct speech-to-speech translation (S2ST) with discrete units leverages recent progress in speech representation learning. Specifically, a sequence of discrete representations derived in a self-supervised manner are predicted from the model and passed to a vocoder for speech reconstruction, while still facing the following challenges: 1) Acoustic multimodality: the discrete units derived from speech with same content could be indeterministic due to the acoustic property (e.g., rhythm, pitch, and energy), which causes deterioration of translation accuracy; 2) high latency: current S2ST systems utilize autoregressive models which predict each unit conditioned on the sequence previously generated, failing to take full advantage of parallelism. In this work, we propose TranSpeech, a speech-to-speech translation model with bilateral perturbation. To alleviate the acoustic multimodal problem, we propose bilateral perturbation (BiP), which consists of the style normalization and information enhancement stages, to learn only the linguistic information from speech samples and generate more deterministic representations. With reduced multimodality, we step forward and become the first to establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices and produces high-accuracy results in just a few cycles. Experimental results on three language pairs demonstrate that BiP yields an improvement of 2.9 BLEU on average compared with a baseline textless S2ST model. Moreover, our parallel decoding shows a significant reduction of inference latency, enabling speedup up to 21.4x than autoregressive technique. Audio samples are available at \url{https://TranSpeech.github.io/}

下载PDF全文

下载文献需遵守相关版权规定

论文标题