论文标题
实例 - 最佳的PAC算法,用于上下文匪徒
Instance-optimal PAC Algorithms for Contextual Bandits
论文作者
论文摘要
在随机上下文的强盗设置中,对遗憾最小化算法进行了广泛的研究,但是他们的实例最少的最佳武器识别对应物很少研究。在这项工作中,我们专注于$(ε,δ)$ - $ \ textit {pac} $设置中的随机匪徒问题:给定一个策略类别$π$学习者的目标是返回π$的保单$π\ inπ$,其预期奖励的预期奖励在最佳政策$ε$中,其最佳政策的可能性大于1-δ$。我们表征了第一个$ \ textIt {Instance依赖性} $通过数量$ρ_π$的上下文匪徒的pac样本复杂性,并根据$ρ_π$为$ρ_π$提供了匹配的上限和下限,以适用于$ρ_π$。我们表明,对于遗憾的最小化和实例依赖性PAC而言,无法同时使用最小值的算法。我们的主要结果是一种新的实例 - 最佳和计算有效算法,该算法依赖于对Argmax Oracle的多项式呼叫。
In the stochastic contextual bandit setting, regret-minimizing algorithms have been extensively researched, but their instance-minimizing best-arm identification counterparts remain seldom studied. In this work, we focus on the stochastic bandit problem in the $(ε,δ)$-$\textit{PAC}$ setting: given a policy class $Π$ the goal of the learner is to return a policy $π\in Π$ whose expected reward is within $ε$ of the optimal policy with probability greater than $1-δ$. We characterize the first $\textit{instance-dependent}$ PAC sample complexity of contextual bandits through a quantity $ρ_Π$, and provide matching upper and lower bounds in terms of $ρ_Π$ for the agnostic and linear contextual best-arm identification settings. We show that no algorithm can be simultaneously minimax-optimal for regret minimization and instance-dependent PAC for best-arm identification. Our main result is a new instance-optimal and computationally efficient algorithm that relies on a polynomial number of calls to an argmax oracle.