论文标题
用于互动决策的渐近实例 - 最佳算法
Asymptotic Instance-Optimal Algorithms for Interactive Decision Making
论文作者
论文摘要
过去关于互动决策问题(匪徒,强化学习等)的研究主要集中在Minimax遗憾的是,在最难的情况下测量了该算法的性能。但是,理想的算法应适应特定问题实例的复杂性,并在简单的实例上引起较小的遗憾。在本文中,我们设计了第一个渐近实例 - 最佳算法,用于在轻度条件下具有有限的决策数量的一般互动决策问题。在每种情况下,我们的算法都胜过所有一致的算法(在所有情况下都能实现非客气遗憾的算法),并且具有渐近的遗憾$ \ MATHCAL {C}(c}(f)\ ln n $,其中$ \ Mathcal {c}(c}(f)$是$ f $的精确表征。该算法的关键步骤涉及通过主动数据收集的假设检验。它计算算法收集观察结果以测试估计实例是否确实正确的最经济决策。因此,复杂性$ \ MATHCAL {C}(f)$是针对其他实例测试实例$ f $的最低成本。我们的结果是在具体问题上实例化的,恢复了多臂匪徒的经典间隙依赖性界限[Lai和Robbins,1985年],以及有关线性匪徒的先前作品[Lattimore和Szepesvari,2017],并改善了先前的最佳实例上限上限[Xu等,2021],用于增强辅助学习。
Past research on interactive decision making problems (bandits, reinforcement learning, etc.) mostly focuses on the minimax regret that measures the algorithm's performance on the hardest instance. However, an ideal algorithm should adapt to the complexity of a particular problem instance and incur smaller regrets on easy instances than worst-case instances. In this paper, we design the first asymptotic instance-optimal algorithm for general interactive decision making problems with finite number of decisions under mild conditions. On every instance $f$, our algorithm outperforms all consistent algorithms (those achieving non-trivial regrets on all instances), and has asymptotic regret $\mathcal{C}(f) \ln n$, where $\mathcal{C}(f)$ is an exact characterization of the complexity of $f$. The key step of the algorithm involves hypothesis testing with active data collection. It computes the most economical decisions with which the algorithm collects observations to test whether an estimated instance is indeed correct; thus, the complexity $\mathcal{C}(f)$ is the minimum cost to test the instance $f$ against other instances. Our results, instantiated on concrete problems, recover the classical gap-dependent bounds for multi-armed bandits [Lai and Robbins, 1985] and prior works on linear bandits [Lattimore and Szepesvari, 2017], and improve upon the previous best instance-dependent upper bound [Xu et al., 2021] for reinforcement learning.