论文标题
检查和扩展通过插值改善个性化语言建模的策略
Examination and Extension of Strategies for Improving Personalized Language Modeling via Interpolation
论文作者
论文摘要
在本文中,我们详细介绍了插值个性化语言模型和方法的新型策略,以处理量不计(OOV)令牌以改善个性化语言模型。使用REDDIT的公开数据,我们通过使用用户个性化的N-Gram模型来插值基于全局LSTM的创作模型来证明用户级别的离线指标的改进。通过退回统一的OOV惩罚和插值系数,我们观察到,超过80%的用户会受到困惑的升力,平均每位用户的困惑性提升为5.2%。在进行这项研究时,我们扩展了以前在建立NLIS方面的工作,并提高了下游任务的指标鲁棒性。
In this paper, we detail novel strategies for interpolating personalized language models and methods to handle out-of-vocabulary (OOV) tokens to improve personalized language models. Using publicly available data from Reddit, we demonstrate improvements in offline metrics at the user level by interpolating a global LSTM-based authoring model with a user-personalized n-gram model. By optimizing this approach with a back-off to uniform OOV penalty and the interpolation coefficient, we observe that over 80% of users receive a lift in perplexity, with an average of 5.2% in perplexity lift per user. In doing this research we extend previous work in building NLIs and improve the robustness of metrics for downstream tasks.