论文标题
通过音节重新审视神经语言建模
Revisiting Neural Language Modelling with Syllables
论文作者
论文摘要
语言建模经常在单词,子字或字符单元上分析,但是很少使用音节。音节提供的序列比字符较短,可以用规则提取它们,并且它们的分割通常需要比识别词素更少的专业工作。我们以20种语言的开放式摄影生成任务重新考虑音节。我们使用五种语言的基于规则的音节化方法,并使用连字符工具来解决其余方法,该工具将其作为音节代理的行为得到验证。有了可比的困惑,我们表明音节的表现优于字符,注释的词素和无监督的子字。最后,我们还研究了有关其他子词的音节的重叠,并讨论了一些局限性和机会。
Language modelling is regularly analysed at word, subword or character units, but syllables are seldom used. Syllables provide shorter sequences than characters, they can be extracted with rules, and their segmentation typically requires less specialised effort than identifying morphemes. We reconsider syllables for an open-vocabulary generation task in 20 languages. We use rule-based syllabification methods for five languages and address the rest with a hyphenation tool, which behaviour as syllable proxy is validated. With a comparable perplexity, we show that syllables outperform characters, annotated morphemes and unsupervised subwords. Finally, we also study the overlapping of syllables concerning other subword pieces and discuss some limitations and opportunities.