论文标题
差异奖励政策梯度
Difference Rewards Policy Gradients
论文作者
论文摘要
策略梯度方法已成为多试剂增强学习的最受欢迎的算法类别之一。然而,其中许多方法没有解决的一个关键挑战是多机构的信贷分配:评估代理商对整体绩效的贡献,这对于学习良好的政策至关重要。我们提出了一种名为Reinforce博士的小说算法,该算法通过将差异与策略梯度相结合,以在已知奖励功能时允许学习分散的政策来明确解决这一问题。通过直接差异奖励功能,Reinforce博士避免了与对Q功能相关的困难,如反事实多基金策略梯度(COMA),这是一种最新的差异奖励方法。对于奖励功能未知的应用程序,我们显示了Reinforce博士版本的有效性,该版本将学习一个额外的奖励网络,该网络用于估计差异奖励。
Policy gradient methods have become one of the most popular classes of algorithms for multi-agent reinforcement learning. A key challenge, however, that is not addressed by many of these methods is multi-agent credit assignment: assessing an agent's contribution to the overall performance, which is crucial for learning good policies. We propose a novel algorithm called Dr.Reinforce that explicitly tackles this by combining difference rewards with policy gradients to allow for learning decentralized policies when the reward function is known. By differencing the reward function directly, Dr.Reinforce avoids difficulties associated with learning the Q-function as done by Counterfactual Multiagent Policy Gradients (COMA), a state-of-the-art difference rewards method. For applications where the reward function is unknown, we show the effectiveness of a version of Dr.Reinforce that learns an additional reward network that is used to estimate the difference rewards.