论文标题

Boltzmann政策分布:人类模型中系统次级临时性的会计

The Boltzmann Policy Distribution: Accounting for Systematic Suboptimality in Human Models

论文作者

Laidlaw, Cassidy, Dragan, Anca

论文摘要

人类的预测和协作行为模型倾向于分为两类:通过模仿学习从大量数据中学习的模型,并且假定人类行为是众多奖励功能的最佳选择。前者非常有用,但是只有在可能在目标环境和分布中收集大量人类数据时。包括Boltzmann合理性在内的后一种类型的优点是,当人类实际上接近最佳状态时,在没有广泛数据的新环境中进行准确预测的能力。但是,这些模型在人类表现出系统的次优性时失败,即当他们的最佳行为偏差不是独立的,而是随着时间的流逝而保持一致的。我们的关键见解是,可以通过预测策略来建模系统的次优性,这些策略会随着时间的推移而不是轨迹进行逐步选择。我们介绍了Boltzmann政策分布(BPD),该策略分布(BPD)是人类政策的先前,并通过贝叶斯推论适应以捕获系统偏差,通过观察人类的行为来捕获系统偏差。 BPD很难计算并表示,因为策略位于高维连续空间中,但是我们利用工具从生成和序列模型来实现有效的采样和推理。我们表明,BPD能够同样地预测人类行为和人类协作,同时模仿基于学习的人类模型,同时使用较少的数据。

Models of human behavior for prediction and collaboration tend to fall into two categories: ones that learn from large amounts of data via imitation learning, and ones that assume human behavior to be noisily-optimal for some reward function. The former are very useful, but only when it is possible to gather a lot of human data in the target environment and distribution. The advantage of the latter type, which includes Boltzmann rationality, is the ability to make accurate predictions in new environments without extensive data when humans are actually close to optimal. However, these models fail when humans exhibit systematic suboptimality, i.e. when their deviations from optimal behavior are not independent, but instead consistent over time. Our key insight is that systematic suboptimality can be modeled by predicting policies, which couple action choices over time, instead of trajectories. We introduce the Boltzmann policy distribution (BPD), which serves as a prior over human policies and adapts via Bayesian inference to capture systematic deviations by observing human actions during a single episode. The BPD is difficult to compute and represent because policies lie in a high-dimensional continuous space, but we leverage tools from generative and sequence models to enable efficient sampling and inference. We show that the BPD enables prediction of human behavior and human-AI collaboration equally as well as imitation learning-based human models while using far less data.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源