论文标题
通过扰动的奖励学习神经背景匪徒
Learning Neural Contextual Bandits Through Perturbed Rewards
论文作者
论文摘要
由于具有代表性学习的力量,神经上下文的匪徒算法对其经典对应物表现出显着的性能提高。但是,由于他们的探索必须在整个神经网络参数空间中进行几乎最佳的遗憾,因此产生的计算成本非常高。在更新神经网络时,我们会驱动奖励,以消除明确探索的需求和相应的计算开销。我们证明$ \ tilde {o}(\ tilde {d} \ sqrt {t})$遗憾的上限仍然可以在标准的规律性条件下实现,其中$ t $是交互的回合数,$ \ tilde {d} $是Neural pantent contents content content content content content content contents kernel kernel matrix的有效尺寸。与几种基准上下文匪徒算法的广泛比较,包括两个最近的神经背景匪徒模型,证明了我们提出的神经匪徒算法的有效性和计算效率。
Thanks to the power of representation learning, neural contextual bandit algorithms demonstrate remarkable performance improvement against their classical counterparts. But because their exploration has to be performed in the entire neural network parameter space to obtain nearly optimal regret, the resulting computational cost is prohibitively high. We perturb the rewards when updating the neural network to eliminate the need of explicit exploration and the corresponding computational overhead. We prove that a $\tilde{O}(\tilde{d}\sqrt{T})$ regret upper bound is still achievable under standard regularity conditions, where $T$ is the number of rounds of interactions and $\tilde{d}$ is the effective dimension of a neural tangent kernel matrix. Extensive comparisons with several benchmark contextual bandit algorithms, including two recent neural contextual bandit models, demonstrate the effectiveness and computational efficiency of our proposed neural bandit algorithm.