论文标题
奖励偏向神经上下文匪徒的最大似然估计
Reward-Biased Maximum Likelihood Estimation for Neural Contextual Bandits
论文作者
论文摘要
奖励有偏见的最大似然估计(RBMLE)是用于解决探索探索探索权衡的自适应控制文献中的经典原则。本文通过一般有限的奖励函数研究了随机上下文的匪徒问题,并提出了神经奖,该问题通过将偏见项添加到实施探索的对数类似项中,从而适应了RBMLE原理。 NeuralRbmle利用神经网络的表示能力并直接编码参数空间中的探索行为,而无需构建估计奖励的置信区间。我们提出了两种神经RBMLE算法的变体:第一个变体直接通过梯度上升获得RBMLE估计器,第二个变体通过近似值将RBMLE简化为简单的索引策略。我们表明,这两种算法实现了$ \ widetilde {\ Mathcal {o}}}(\ sqrt {t})$遗憾。通过广泛的实验,我们证明了与具有非线性奖励函数的真实世界数据集上的最新方法相比,神经RBMLE算法获得了可比或更好的经验遗憾。
Reward-biased maximum likelihood estimation (RBMLE) is a classic principle in the adaptive control literature for tackling explore-exploit trade-offs. This paper studies the stochastic contextual bandit problem with general bounded reward functions and proposes NeuralRBMLE, which adapts the RBMLE principle by adding a bias term to the log-likelihood to enforce exploration. NeuralRBMLE leverages the representation power of neural networks and directly encodes exploratory behavior in the parameter space, without constructing confidence intervals of the estimated rewards. We propose two variants of NeuralRBMLE algorithms: The first variant directly obtains the RBMLE estimator by gradient ascent, and the second variant simplifies RBMLE to a simple index policy through an approximation. We show that both algorithms achieve $\widetilde{\mathcal{O}}(\sqrt{T})$ regret. Through extensive experiments, we demonstrate that the NeuralRBMLE algorithms achieve comparable or better empirical regrets than the state-of-the-art methods on real-world datasets with non-linear reward functions.