加固学习的内在波动促进合作

论文标题

加固学习的内在波动促进合作

Intrinsic fluctuations of reinforcement learning promote cooperation

论文作者

Barfuss, Wolfram, Meylahn, Janusz

论文摘要

在这项工作中，我们询问并回答是什么使经典的时间差异加强与Epsilon-Greedy策略合作进行。在社会困境情况下合作对于动物，人类和机器至关重要。尽管进化论揭示了一系列促进合作的机制，但代理商学习合作的条件受到争议。在这里，我们证明了多项式学习设置的哪些单个要素如何导致合作。我们将迭代的囚犯的困境与一期记忆一起用作测试床。两个学习代理中的每一个都学会了一种策略，该策略可以在最后一轮的两个代理商的行动选择上验证以下动作选择。我们发现，除了对未来奖励的高度关心，较低的探索率和较小的学习率之后，这主要是加强学习过程的内在随机波动，最终合作率将最终的合作率翻了一番，高达80％。因此，固有的噪声不是迭代学习过程的必要邪恶。这是学习合作的关键资产。但是，我们还指出了合作行为的很高可能性与在合理的时间内实现这一目标之间的权衡。我们的发现与有目的地设计合作算法和调节不期望的犯罪效果有关。

In this work, we ask for and answer what makes classical temporal-difference reinforcement learning with epsilon-greedy strategies cooperative. Cooperating in social dilemma situations is vital for animals, humans, and machines. While evolutionary theory revealed a range of mechanisms promoting cooperation, the conditions under which agents learn to cooperate are contested. Here, we demonstrate which and how individual elements of the multi-agent learning setting lead to cooperation. We use the iterated Prisoner's dilemma with one-period memory as a testbed. Each of the two learning agents learns a strategy that conditions the following action choices on both agents' action choices of the last round. We find that next to a high caring for future rewards, a low exploration rate, and a small learning rate, it is primarily intrinsic stochastic fluctuations of the reinforcement learning process which double the final rate of cooperation to up to 80%. Thus, inherent noise is not a necessary evil of the iterative learning process. It is a critical asset for the learning of cooperation. However, we also point out the trade-off between a high likelihood of cooperative behavior and achieving this in a reasonable amount of time. Our findings are relevant for purposefully designing cooperative algorithms and regulating undesired collusive effects.

下载PDF全文

下载文献需遵守相关版权规定

论文标题