用声学作品预训练语音

论文标题

用声学作品预训练语音

Speech Pre-training with Acoustic Piece

论文作者

Ren, Shuo, Liu, Shujie, Wu, Yu, Zhou, Long, Wei, Furu

论文摘要

先前的语音预训练方法，例如WAV2VEC2.0和Hubert，预训练了变压器编码器，以从音频数据中学习深层表示，其目标可以预测潜在矢量量化空间的元素，或者用离线聚类中的预先生成的标签（称为目标代码）。但是，这些培训信号（量化元素或代码）在不同令牌的情况下是独立的，而无需考虑它们的关系。根据我们的观察和分析，目标代码共享与音调文本数据一致的明显模式。基于此，我们建议利用这些模式来更好地预先培训代码之间的关系。我们提取的图案称为“声学作品”，来自Hubert代码的句子结果。以声学作品为训练信号，我们可以隐式地桥接输入音频和自然语言，从而使音频到文本任务受益，例如自动语音识别（ASR）。简单但有效，我们的方法“ Hubert-ap”在LibrisPeech ASR任务上的表现明显优于强大的基线。

Previous speech pre-training methods, such as wav2vec2.0 and HuBERT, pre-train a Transformer encoder to learn deep representations from audio data, with objectives predicting either elements from latent vector quantized space or pre-generated labels (known as target codes) with offline clustering. However, those training signals (quantized elements or codes) are independent across different tokens without considering their relations. According to our observation and analysis, the target codes share obvious patterns aligned with phonemized text data. Based on that, we propose to leverage those patterns to better pre-train the model considering the relations among the codes. The patterns we extracted, called "acoustic piece"s, are from the sentence piece result of HuBERT codes. With the acoustic piece as the training signal, we can implicitly bridge the input audio and natural language, which benefits audio-to-text tasks, such as automatic speech recognition (ASR). Simple but effective, our method "HuBERT-AP" significantly outperforms strong baselines on the LibriSpeech ASR task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题