使用Deep Q-Networks，在挑剔的游戏向导中改善竞标和玩策略

论文标题

使用Deep Q-Networks，在挑剔的游戏向导中改善竞标和玩策略

Improving Bidding and Playing Strategies in the Trick-Taking game Wizard using Deep Q-Networks

论文作者

Schumacher, Jonas, Pleines, Marco

论文摘要

在这项工作中，带有单独竞标和玩阶段的诀窍向导是由两个交错的部分可观察到的马尔可夫决策过程（POMDP）建模的。深层Q-NETWORKS（DQN）用于增强自我改善的代理，这些代理能够应对高度非平稳环境的挑战。为了比较彼此之间的算法，监视了出价和技巧数量之间的准确性，这与实际奖励密切相关，并提供了定义明确的上层和较低性能结合。训练有素的DQN代理在自我播放中获得了66％至87％的精度，剩下的是随机基线和基于规则的启发式方法。进行的分析还揭示了有关竞标期间玩家位置的强烈信息不对称。为了克服不完美信息游戏的Markov属性，实施了长期的短期内存（LSTM）网络，以将历史信息集成到决策过程中。此外，通过对环境的状态进行采样，从而将游戏变成完美的信息设置，从而进行前向树木搜索。令我们惊讶的是，两种方法都不会超过基本DQN代理的性能。

In this work, the trick-taking game Wizard with a separate bidding and playing phase is modeled by two interleaved partially observable Markov decision processes (POMDP). Deep Q-Networks (DQN) are used to empower self-improving agents, which are capable of tackling the challenges of a highly non-stationary environment. To compare algorithms between each other, the accuracy between bid and trick count is monitored, which strongly correlates with the actual rewards and provides a well-defined upper and lower performance bound. The trained DQN agents achieve accuracies between 66% and 87% in self-play, leaving behind both a random baseline and a rule-based heuristic. The conducted analysis also reveals a strong information asymmetry concerning player positions during bidding. To overcome the missing Markov property of imperfect-information games, a long short-term memory (LSTM) network is implemented to integrate historic information into the decision-making process. Additionally, a forward-directed tree search is conducted by sampling a state of the environment and thereby turning the game into a perfect information setting. To our surprise, both approaches do not surpass the performance of the basic DQN agent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题