具有准确有效的分层RNN模型的语言建模的资本归一化

论文标题

具有准确有效的分层RNN模型的语言建模的资本归一化

Capitalization Normalization for Language Modeling with an Accurate and Efficient Hierarchical RNN Model

论文作者

Zhang, Hao, Cheng, You-Chi, Kumar, Shankar, Huang, W. Ronny, Chen, Mingqing, Mathews, Rajiv

论文摘要

大写归一化（TrueCasing）是恢复嘈杂文本的正确情况（大写或小写）的任务。我们提出了一个快速，准确，紧凑的两级分层层次和字符的基于重复的神经网络模型。我们使用TrueCaser将用户生成的文本标准化，以用于语言建模的联合学习框架。对此归一化文本进行训练的案例感知语言模型与在文本上训练具有黄金大写字母的模型相同。在真实的用户A/B实验中，我们证明了改进可以转化为虚拟键盘应用程序中的预测错误率降低。同样，在ASR语言模型融合实验中，我们显示了大写字符错误率和单词错误率的降低。

Capitalization normalization (truecasing) is the task of restoring the correct case (uppercase or lowercase) of noisy text. We propose a fast, accurate and compact two-level hierarchical word-and-character-based recurrent neural network model. We use the truecaser to normalize user-generated text in a Federated Learning framework for language modeling. A case-aware language model trained on this normalized text achieves the same perplexity as a model trained on text with gold capitalization. In a real user A/B experiment, we demonstrate that the improvement translates to reduced prediction error rates in a virtual keyboard application. Similarly, in an ASR language model fusion experiment, we show reduction in uppercase character error rate and word error rate.

下载PDF全文

下载文献需遵守相关版权规定

论文标题