论文标题
Stemm:语音文本歧管混音的自学学习语音翻译
STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation
论文作者
论文摘要
如何学习具有有限标签数据的端到端语音到文本翻译(ST)的更好的语音表示?现有技术通常试图将强大的机器翻译(MT)功能传递到ST,但忽略了跨模式的表示差异。在本文中,我们提出了语音文本歧管混合(StemM)方法来校准这种差异。具体而言,我们将不同模态的表示序列混合在一起,并将单峰语音序列和多模态混合序列作为并行的转换模型输入,并使用自学习框架正规化其输出预测。关于必须使用的语音翻译基准和进一步分析的实验表明,我们的方法有效地减轻了交叉模式的差异,并在八个翻译方向上的强大基准方面取得了重大改进。
How to learn a better speech representation for end-to-end speech-to-text translation (ST) with limited labeled data? Existing techniques often attempt to transfer powerful machine translation (MT) capabilities to ST, but neglect the representation discrepancy across modalities. In this paper, we propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy. Specifically, we mix up the representation sequences of different modalities, and take both unimodal speech sequences and multimodal mixed sequences as input to the translation model in parallel, and regularize their output predictions with a self-learning framework. Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy, and achieves significant improvements over a strong baseline on eight translation directions.