论文标题
Pac-Bayesian离线上下文匪徒保证
PAC-Bayesian Offline Contextual Bandits With Guarantees
论文作者
论文摘要
本文介绍了一种新的原则性方法,用于在上下文匪徒中进行非政策学习。与以前的工作不同,我们的方法不会从棘手或松散的界限中得出学习原理。我们通过Pac-Bayesian镜头分析了问题,将政策解释为决策规则的混合。这使我们能够提出新颖的概括范围并提供可拖动的算法以优化它们。我们证明,派生的界限比其竞争对手更紧密,可以直接优化以自信地脱机记录策略。我们的方法通过保证,使用所有可用数据来学习政策,并且不需要在持有集合中调整其他超参数。我们通过广泛的实验来证明我们的方法在实际情况下提供绩效保证的有效性。
This paper introduces a new principled approach for off-policy learning in contextual bandits. Unlike previous work, our approach does not derive learning principles from intractable or loose bounds. We analyse the problem through the PAC-Bayesian lens, interpreting policies as mixtures of decision rules. This allows us to propose novel generalization bounds and provide tractable algorithms to optimize them. We prove that the derived bounds are tighter than their competitors, and can be optimized directly to confidently improve upon the logging policy offline. Our approach learns policies with guarantees, uses all available data and does not require tuning additional hyperparameters on held-out sets. We demonstrate through extensive experiments the effectiveness of our approach in providing performance guarantees in practical scenarios.