论文标题
英国退欧:专家迭代中的对手建模
BRExIt: On Opponent Modelling in Expert Iteration
论文作者
论文摘要
找到最佳响应政策是游戏理论和多机构学习的核心目标,采用现代人群的培训方法采用强化学习算法作为最佳反应词,以改善对抗候选对手的比赛(通常是先前学到的政策)。我们提出了最佳响应专家迭代(英国退欧),该迭代通过将对手模型纳入最新的学习算法专家迭代(EXIT)来加速游戏中的学习。英国退欧的目标是(1)改善学徒的特征塑造,政策负责人将对手政策预测为辅助任务,(2)偏见对手朝着给定或学到的对手模型进行计划,以产生更好地近似最佳响应的学徒目标。在对英国脱欧算法变体对一组固定测试剂的经验消融中,我们提供了统计证据,表明英国退欧学会比出口更好地履行策略。
Finding a best response policy is a central objective in game theory and multi-agent learning, with modern population-based training approaches employing reinforcement learning algorithms as best-response oracles to improve play against candidate opponents (typically previously learnt policies). We propose Best Response Expert Iteration (BRExIt), which accelerates learning in games by incorporating opponent models into the state-of-the-art learning algorithm Expert Iteration (ExIt). BRExIt aims to (1) improve feature shaping in the apprentice, with a policy head predicting opponent policies as an auxiliary task, and (2) bias opponent moves in planning towards the given or learnt opponent model, to generate apprentice targets that better approximate a best response. In an empirical ablation on BRExIt's algorithmic variants against a set of fixed test agents, we provide statistical evidence that BRExIt learns better performing policies than ExIt.