简短视频推荐的限制增强学习

论文标题

简短视频推荐的限制增强学习

Constrained Reinforcement Learning for Short Video Recommendation

论文作者

Cai, Qingpeng, Zhan, Ruohan, Zhang, Chi, Zheng, Jie, Ding, Guangwei, Gong, Pinghua, Zheng, Dong, Jiang, Peng

论文摘要

社交媒体上简短视频的广泛欢迎带来了新的机会和挑战，以优化视频共享平台上的推荐系统。用户对建议提供了复杂而多方面的响应，包括观看时间和与视频的各种互动。结果，涉及单个目标的既定建议算法不足以满足优化全面用户体验的新需求。在本文中，我们将简短视频推荐的问题作为约束的马尔可夫决策过程（MDP），在该问题中，平台希望长期优化用户观看时间的主要目标，并限制了容纳用户互动（例如共享/下载视频）的辅助响应的限制。为了解决受约束的MDP，我们提出了一种基于参与者批评框架的两阶段增强学习方法。在第一阶段，我们学习各个政策以优化每个辅助响应。在第二阶段，我们学习了一项政策，以（i）优化主要响应，（ii）在第一阶段保持靠近学到的政策，这有效地保证了对辅助机构的这一主要政策的执行。通过广泛的模拟，我们证明了方法对替代方案的有效性在优化主要目标以及平衡其他目标方面。我们进一步展示了我们在简短视频建议的实时实验中的方法，在观看时间和视频视图中的互动方面，它大大优于其他基线。我们的方法已在生产系统中完全启动，以优化平台上的用户体验。

The wide popularity of short videos on social media poses new opportunities and challenges to optimize recommender systems on the video-sharing platforms. Users provide complex and multi-faceted responses towards recommendations, including watch time and various types of interactions with videos. As a result, established recommendation algorithms that concern a single objective are not adequate to meet this new demand of optimizing comprehensive user experiences. In this paper, we formulate the problem of short video recommendation as a constrained Markov Decision Process (MDP), where platforms want to optimize the main goal of user watch time in long term, with the constraint of accommodating the auxiliary responses of user interactions such as sharing/downloading videos. To solve the constrained MDP, we propose a two-stage reinforcement learning approach based on actor-critic framework. At stage one, we learn individual policies to optimize each auxiliary response. At stage two, we learn a policy to (i) optimize the main response and (ii) stay close to policies learned at the first stage, which effectively guarantees the performance of this main policy on the auxiliaries. Through extensive simulations, we demonstrate effectiveness of our approach over alternatives in both optimizing the main goal as well as balancing the others. We further show the advantage of our approach in live experiments of short video recommendations, where it significantly outperforms other baselines in terms of watch time and interactions from video views. Our approach has been fully launched in the production system to optimize user experiences on the platform.

下载PDF全文

下载文献需遵守相关版权规定

论文标题