实例 - 最佳的PAC算法，用于上下文匪徒

论文标题

实例 - 最佳的PAC算法，用于上下文匪徒

Instance-optimal PAC Algorithms for Contextual Bandits

论文作者

Li, Zhaoqi, Ratliff, Lillian, Nassif, Houssam, Jamieson, Kevin, Jain, Lalit

论文摘要

在随机上下文的强盗设置中，对遗憾最小化算法进行了广泛的研究，但是他们的实例最少的最佳武器识别对应物很少研究。在这项工作中，我们专注于$（ε，δ）$ - $ \ textit {pac} $设置中的随机匪徒问题：给定一个策略类别$π$学习者的目标是返回π$的保单$π\ inπ$，其预期奖励的预期奖励在最佳政策$ε$中，其最佳政策的可能性大于1-δ$。我们表征了第一个$ \ textIt {Instance依赖性} $通过数量$ρ_π$的上下文匪徒的pac样本复杂性，并根据$ρ_π$为$ρ_π$提供了匹配的上限和下限，以适用于$ρ_π$。我们表明，对于遗憾的最小化和实例依赖性PAC而言，无法同时使用最小值的算法。我们的主要结果是一种新的实例 - 最佳和计算有效算法，该算法依赖于对Argmax Oracle的多项式呼叫。

In the stochastic contextual bandit setting, regret-minimizing algorithms have been extensively researched, but their instance-minimizing best-arm identification counterparts remain seldom studied. In this work, we focus on the stochastic bandit problem in the $(ε,δ)$-$\textit{PAC}$ setting: given a policy class $Π$ the goal of the learner is to return a policy $π\in Π$ whose expected reward is within $ε$ of the optimal policy with probability greater than $1-δ$. We characterize the first $\textit{instance-dependent}$ PAC sample complexity of contextual bandits through a quantity $ρ_Π$, and provide matching upper and lower bounds in terms of $ρ_Π$ for the agnostic and linear contextual best-arm identification settings. We show that no algorithm can be simultaneously minimax-optimal for regret minimization and instance-dependent PAC for best-arm identification. Our main result is a new instance-optimal and computationally efficient algorithm that relies on a polynomial number of calls to an argmax oracle.

下载PDF全文

下载文献需遵守相关版权规定

论文标题