混合效应的汤普森抽样

论文标题

混合效应的汤普森抽样

Mixed-Effect Thompson Sampling

论文作者

Aouali, Imad, Kveton, Branislav, Katariya, Sumeet

论文摘要

上下文强盗是在线学习下在不确定性下行事的流行框架。实际上，行动的数量很大，他们的预期奖励是相关的。在这项工作中，我们介绍了一个通用框架，用于通过混合效应模型捕获此类相关性，其中操作通过多个共享效应参数相关。为了使用这种结构有效探索，我们提出了混合效应的汤普森采样（MetS），并束缚了贝叶斯的遗憾。遗憾的界限有两个术语，一个用于学习动作参数，另一个用于学习共享效果参数。术语反映了我们模型的结构和先验的质量。我们的理论发现使用综合和现实世界中的问题对经验进行了验证。我们还提出了许多实际利益的扩展。尽管他们没有保证，但它们的经验表现良好，并显示了拟议框架的一般性。

A contextual bandit is a popular framework for online learning to act under uncertainty. In practice, the number of actions is huge and their expected rewards are correlated. In this work, we introduce a general framework for capturing such correlations through a mixed-effect model where actions are related through multiple shared effect parameters. To explore efficiently using this structure, we propose Mixed-Effect Thompson Sampling (meTS) and bound its Bayes regret. The regret bound has two terms, one for learning the action parameters and the other for learning the shared effect parameters. The terms reflect the structure of our model and the quality of priors. Our theoretical findings are validated empirically using both synthetic and real-world problems. We also propose numerous extensions of practical interest. While they do not come with guarantees, they perform well empirically and show the generality of the proposed framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题