在混合和延迟奖励下通过加强学习的增量竞标

论文标题

在混合和延迟奖励下通过加强学习的增量竞标

Incrementality Bidding via Reinforcement Learning under Mixed and Delayed Rewards

论文作者

Badanidiyuru, Ashwinkumar, Feng, Zhe, Li, Tianxi, Xu, Haifeng

论文摘要

增量是用来衡量向潜在客户展示广告（例如在互联网平台中的用户）而不是不是的因果效应的，而不是在线广告平台中的广告商的核心对象。本文研究了广告商如何以在线方式优化招标序列的问题\ emph {而不}，而无需事先了解增量参数。 We formulate the offline version of this problem as a specially structured episodic Markov Decision Process (MDP) and then, for its online learning counterpart, propose a novel reinforcement learning (RL) algorithm with regret at most $\widetilde{O}(H^2\sqrt{T})$, which depends on the number of rounds $H$ and number of episodes $T$, but does not depend on the动作数量（即可能的出价）。我们从标准RL问题中学习问题之间的基本差异是，转换增量的实现奖励反馈是\ emph {混合}和\ emph {delayed}。为了应对这一困难，我们提出和分析了一种新颖的成对矩匹配算法来学习转换增量，我们认为这独立于利益。

Incrementality, which is used to measure the causal effect of showing an ad to a potential customer (e.g. a user in an internet platform) versus not, is a central object for advertisers in online advertising platforms. This paper investigates the problem of how an advertiser can learn to optimize the bidding sequence in an online manner \emph{without} knowing the incrementality parameters in advance. We formulate the offline version of this problem as a specially structured episodic Markov Decision Process (MDP) and then, for its online learning counterpart, propose a novel reinforcement learning (RL) algorithm with regret at most $\widetilde{O}(H^2\sqrt{T})$, which depends on the number of rounds $H$ and number of episodes $T$, but does not depend on the number of actions (i.e., possible bids). A fundamental difference between our learning problem from standard RL problems is that the realized reward feedback from conversion incrementality is \emph{mixed} and \emph{delayed}. To handle this difficulty we propose and analyze a novel pairwise moment-matching algorithm to learn the conversion incrementality, which we believe is of independent of interest.

下载PDF全文

下载文献需遵守相关版权规定

论文标题