论文标题
用于编码器注意模型的基于素式和基于音素的标签单元的系统比较
A systematic comparison of grapheme-based vs. phoneme-based label units for encoder-decoder-attention models
论文作者
论文摘要
遵循端到端建模的基本原理,用于自动语音识别(ASR)的CTC,RNN-T或编码器 - 编码器注意模型(ASR)使用基于素数或基于Grupheme的子字单元基于例如。字节对编码(BPE)。从发音到拼写的映射完全是从数据中学到的。与此相比,ASR的经典方法采用音素列表的形式采用次级知识来源来定义语音输出标签和发音词典。在这项工作中,我们对编码器 - 码头注意的ASR模型进行基于字符素和基于音素的输出标签进行系统比较。我们研究了单个音素以及基于BPE的音素组作为模型的输出标签的使用。为了保留简化和高效的解码器设计,我们还扩展了由辅助单元设置的音素,以便能够区分同音词。在总机300H和LibrisPeech基准测试上执行的实验表明,基于音素的建模与基于石墨素的编码器 - 二十座型建模具有竞争力。
Following the rationale of end-to-end modeling, CTC, RNN-T or encoder-decoder-attention models for automatic speech recognition (ASR) use graphemes or grapheme-based subword units based on e.g. byte-pair encoding (BPE). The mapping from pronunciation to spelling is learned completely from data. In contrast to this, classical approaches to ASR employ secondary knowledge sources in the form of phoneme lists to define phonetic output labels and pronunciation lexica. In this work, we do a systematic comparison between grapheme- and phoneme-based output labels for an encoder-decoder-attention ASR model. We investigate the use of single phonemes as well as BPE-based phoneme groups as output labels of our model. To preserve a simplified and efficient decoder design, we also extend the phoneme set by auxiliary units to be able to distinguish homophones. Experiments performed on the Switchboard 300h and LibriSpeech benchmarks show that phoneme-based modeling is competitive to grapheme-based encoder-decoder-attention modeling.