移动通知的离线加固学习

论文标题

移动通知的离线加固学习

Offline Reinforcement Learning for Mobile Notifications

论文作者

Yuan, Yiping, Muralidharan, Ajith, Nandy, Preetam, Cheng, Miao, Prabhakar, Prakruthi

论文摘要

移动通知系统在驱动和维持在线平台的用户参与度中发挥了重要作用。它们是对机器学习从业人员进行更顺序和长期反馈注意事项的有趣的推荐系统。通知系统中的大多数机器学习应用都是围绕响应预测模型构建的，试图将短期影响和长期影响归因于通知决策。但是，用户的经验取决于一系列通知和对单个通知的影响的序列，即使不是不可能，也不总是准确。在本文中，我们认为，从性能和迭代速度方面，加强学习是通知系统的更好框架。我们提出了一个离线增强学习框架，以优化顺序通知决策，以推动用户参与度。我们描述了一种国家划分的重要性抽样政策评估方法，该方法可用于评估策略离线和调整学习超级方案。通过近似通知生态系统的模拟，我们证明了离线评估方法的性能和好处，这是强化学习建模方法的一部分。最后，我们通过生产系统中的在线探索收集数据，培训离线双重Q网络，并在线启动成功的政策。我们还讨论了通过为大规模推荐系统用例部署这些策略而获得的实际考虑和结果。

Mobile notification systems have taken a major role in driving and maintaining user engagement for online platforms. They are interesting recommender systems to machine learning practitioners with more sequential and long-term feedback considerations. Most machine learning applications in notification systems are built around response-prediction models, trying to attribute both short-term impact and long-term impact to a notification decision. However, a user's experience depends on a sequence of notifications and attributing impact to a single notification is not always accurate, if not impossible. In this paper, we argue that reinforcement learning is a better framework for notification systems in terms of performance and iteration speed. We propose an offline reinforcement learning framework to optimize sequential notification decisions for driving user engagement. We describe a state-marginalized importance sampling policy evaluation approach, which can be used to evaluate the policy offline and tune learning hyperparameters. Through simulations that approximate the notifications ecosystem, we demonstrate the performance and benefits of the offline evaluation approach as a part of the reinforcement learning modeling approach. Finally, we collect data through online exploration in the production system, train an offline Double Deep Q-Network and launch a successful policy online. We also discuss the practical considerations and results obtained by deploying these policies for a large-scale recommendation system use-case.

下载PDF全文

下载文献需遵守相关版权规定

论文标题