论文标题

分层的增强学习:面对不确定性和持续遗憾的悲观主义

Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret

论文作者

Huang, Jiawei, Zhao, Li, Qin, Tao, Chen, Wei, Jiang, Nan, Liu, Tie-Yan

论文摘要

我们提出了一个新的学习框架,该框架捕获了许多现实世界中用户交互应用程序的分层结构,在该框架中,可以根据探索风险的不同容忍度将用户分为两组,应分别处理。在这种情况下,我们同时维护两个策略$π^{\ text {o}} $和$π^{\ text {e}} $:$π^{\ text {o text {o}} $(“ o” for for for the Online''for the First Tier and for the the tere and for and Fornal sefortiatiate and Fornal for and Fornal for and for nes Hevorance for and Fally and souncal for shouncome,并在探索中,并探索均衡,并在范围内探索,并在范围内探索,并在此期间。 $π^{\ text {e}} $(“ exploit”)专注于利用到目前为止收集到的数据的第二层的规避风险用户的利用。一个重要的问题是,这种分离是否比标准在线设置(即,对于避免风险的用户,是否都比标准在线设置(即$π^{\ text {e}} =π^{\ text {o}} $)。我们单独考虑与差距无关的与差距依赖性设置。对于前者而言,我们证明从最小值的角度来看,分离确实不是有益的。对于后者,我们表明,如果选择悲观的价值迭代作为剥削算法来产生$π^{\ text {e}} $,我们可以不断地对与$ k $的情节数量无关,而与任何在线$ω(\ log k)相同的遗憾的是,$ k $的情节$ k $的数量与任何在线and的遗憾相同,而却是遗憾的。 $π^{\ text {o}} $(几乎)保持其在线遗憾最优性,并且不需要妥协$π^{\ text {e}} $。

We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $π^{\text{O}}$ and $π^{\text{E}}$: $π^{\text{O}}$ ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while $π^{\text{E}}$ ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., $π^{\text{E}}=π^{\text{O}}$) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce $π^{\text{E}}$, we can achieve a constant regret for risk-averse users independent of the number of episodes $K$, which is in sharp contrast to the $Ω(\log K)$ regret for any online RL algorithms in the same setting, while the regret of $π^{\text{O}}$ (almost) maintains its online regret optimality and does not need to compromise for the success of $π^{\text{E}}$.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源