分支增强学习

论文标题

分支增强学习

Branching Reinforcement Learning

论文作者

Du, Yihan, Chen, Wei

论文摘要

在本文中，我们提出了一种新颖的分支增强学习（分支RL）模型，并研究了此模型的遗憾最小化（RM）和无奖励探索（RFE）指标。与标准RL不同的是，每个情节的轨迹是单个$ h $步骤路径，分支RL允许代理在一个状态下采用多个基本动作，从而相应地将分支向多个后继状态出现，从而生成树结构化的轨迹。该模型在分层推荐系统和在线广告中找到了重要的应用程序。对于分支RL，我们建立了新的Bellman方程和关键引理，即，在指数级别的轨迹下，总方差的分支值差异引理和分支定律也仅以$ O（H^2）$限制。对于RM和RFE指标，我们分别提出了计算有效的算法Branchvi和Branchrfe，并得出了几乎匹配的上限和下限。尽管大大轨迹呈指数型，但我们的结果在问题参数上只是多项式。

In this paper, we propose a novel Branching Reinforcement Learning (Branching RL) model, and investigate both Regret Minimization (RM) and Reward-Free Exploration (RFE) metrics for this model. Unlike standard RL where the trajectory of each episode is a single $H$-step path, branching RL allows an agent to take multiple base actions in a state such that transitions branch out to multiple successor states correspondingly, and thus it generates a tree-structured trajectory. This model finds important applications in hierarchical recommendation systems and online advertising. For branching RL, we establish new Bellman equations and key lemmas, i.e., branching value difference lemma and branching law of total variance, and also bound the total variance by only $O(H^2)$ under an exponentially-large trajectory. For RM and RFE metrics, we propose computationally efficient algorithms BranchVI and BranchRFE, respectively, and derive nearly matching upper and lower bounds. Our results are only polynomial in problem parameters despite exponentially-large trajectories.

下载PDF全文

下载文献需遵守相关版权规定

论文标题