关于慈善捐赠的调查实验的上下文强盗：实验性结果与政策学习

论文标题

关于慈善捐赠的调查实验的上下文强盗：实验性结果与政策学习

Contextual Bandits in a Survey Experiment on Charitable Giving: Within-Experiment Outcomes versus Policy Learning

论文作者

Athey, Susan, Byambadalai, Undral, Hadad, Vitor, Krishnamurthy, Sanath Kumar, Leung, Weiwen, Williams, Joseph Jay

论文摘要

我们设计和实施自适应实验（````上下文'''）来学习有针对性的治疗作业政策，其目标是使用参与者的调查响应来确定哪种慈善机构在捐赠招标中将其暴露在内。设计平衡了两个相互竞争的目标：优化实验中对象的结果（``累积的遗憾最小化''），并收集对政策学习最有用的数据，也就是说，“如果在实验后使用的话，则可以最大程度地提高福利（简单的遗憾最小化最小化''）。我们通过收集试验数据然后进行仿真研究来评估替代实验设计。接下来，我们实现所选算法。最后，我们进行了第二项模拟研究，该研究基于收集的数据，该研究评估了我们选择的算法的好处。我们的第一个结果是，当通过统一的随机分组收集数据时，在此环境中学习的策略的价值更高，而不是使用标准累积遗憾最小化或政策学习算法自适应地收集。我们为自适应实验提出了一种简单的启发式启发式，从政策学习的角度来改善统一的随机化，而牺牲了相对于替代匪徒的累积后悔的牺牲。启发式方法通过（i）对分配概率施加较低的腐烂概率来修饰现有的上下文匪徒算法，以使任何手臂都不会太快丢弃，并且（ii）在自适应收集数据，限制了政策学习以从足够数据收集到足够数据的手臂中进行选择。

We design and implement an adaptive experiment (a ``contextual bandit'') to learn a targeted treatment assignment policy, where the goal is to use a participant's survey responses to determine which charity to expose them to in a donation solicitation. The design balances two competing objectives: optimizing the outcomes for the subjects in the experiment (``cumulative regret minimization'') and gathering data that will be most useful for policy learning, that is, for learning an assignment rule that will maximize welfare if used after the experiment (``simple regret minimization''). We evaluate alternative experimental designs by collecting pilot data and then conducting a simulation study. Next, we implement our selected algorithm. Finally, we perform a second simulation study anchored to the collected data that evaluates the benefits of the algorithm we chose. Our first result is that the value of a learned policy in this setting is higher when data is collected via a uniform randomization rather than collected adaptively using standard cumulative regret minimization or policy learning algorithms. We propose a simple heuristic for adaptive experimentation that improves upon uniform randomization from the perspective of policy learning at the expense of increasing cumulative regret relative to alternative bandit algorithms. The heuristic modifies an existing contextual bandit algorithm by (i) imposing a lower bound on assignment probabilities that decay slowly so that no arm is discarded too quickly, and (ii) after adaptively collecting data, restricting policy learning to select from arms where sufficient data has been gathered.

下载PDF全文

下载文献需遵守相关版权规定

论文标题