论文标题
结合离线因果推理和在线强盗学习以进行数据驱动决定
Combining Offline Causal Inference and Online Bandit Learning for Data Driven Decision
论文作者
论文摘要
对于拥有大量记录数据的公司来说,一个基本问题是:如何使用此类记录的数据以及传入的流数据来做出好的决定?目前,许多公司通过在线A/B测试做出决策,但是在测试期间决定错误的决定会伤害用户的经验并造成不可逆转的损害。典型的替代方法是离线因果推断,该推论仅分析记录的数据以做出决定。但是,这些决定并不适合新传入数据,因此错误的决定将不断损害用户的经历。为了克服上述局限性,我们提出了一个框架来统一离线因果推理算法(例如,加权,匹配)和在线学习算法(例如UCB,Linucb)。我们提出了新颖的算法,并通过“遗憾”的概念提出了决策准确性的界限。我们获得了基于森林的在线强盗算法的第一个上后悔。两个真实数据集的实验表明,我们的算法优于仅使用已记录的数据或在线反馈的其他算法,或者不正确使用数据的算法。
A fundamental question for companies with large amount of logged data is: How to use such logged data together with incoming streaming data to make good decisions? Many companies currently make decisions via online A/B tests, but wrong decisions during testing hurt users' experiences and cause irreversible damage. A typical alternative is offline causal inference, which analyzes logged data alone to make decisions. However, these decisions are not adaptive to the new incoming data, and so a wrong decision will continuously hurt users' experiences. To overcome the aforementioned limitations, we propose a framework to unify offline causal inference algorithms (e.g., weighting, matching) and online learning algorithms (e.g., UCB, LinUCB). We propose novel algorithms and derive bounds on the decision accuracy via the notion of "regret". We derive the first upper regret bound for forest-based online bandit algorithms. Experiments on two real datasets show that our algorithms outperform other algorithms that use only logged data or online feedbacks, or algorithms that do not use the data properly.