概括，混乱和局限性近端策略优化

论文标题

概括，混乱和局限性近端策略优化

Generalization, Mayhems and Limits in Recurrent Proximal Policy Optimization

论文作者

Pleines, Marco, Pallasch, Matthias, Zimmer, Frank, Preuss, Mike

论文摘要

乍一看，在深钢筋学习算法中使用复发层以使代理在部分可观察到的环境的情况下使用内存似乎很简单。从广泛使用的近端策略优化（PPO）开始，我们强调了至关重要的细节，即在添加复发时必须正确的细节，以实现正确且有效的实现，即：正确地塑造神经网的前进数据，安排训练数据，相应地选择隐藏的状态以进行序列开始和掩盖损耗计算。我们通过对贡献的新型环境进行基准测试迫击不管和刺激的聚光灯来进一步探索复发性PPO的局限性，从而挑战了代理人的记忆，超出了仅能的容量和分心任务。值得注意的是，在扩展训练种子的数量时，我们可以证明在迫击炮混乱中的强有力概括，而代理人没有成功地进行灼热的聚光灯，这对于基于内存的代理来说似乎是一个艰巨的挑战。

At first sight it may seem straightforward to use recurrent layers in Deep Reinforcement Learning algorithms to enable agents to make use of memory in the setting of partially observable environments. Starting from widely used Proximal Policy Optimization (PPO), we highlight vital details that one must get right when adding recurrence to achieve a correct and efficient implementation, namely: properly shaping the neural net's forward pass, arranging the training data, correspondingly selecting hidden states for sequence beginnings and masking paddings for loss computation. We further explore the limitations of recurrent PPO by benchmarking the contributed novel environments Mortar Mayhem and Searing Spotlights that challenge the agent's memory beyond solely capacity and distraction tasks. Remarkably, we can demonstrate a transition to strong generalization in Mortar Mayhem when scaling the number of training seeds, while the agent does not succeed on Searing Spotlights, which seems to be a tough challenge for memory-based agents.

下载PDF全文

下载文献需遵守相关版权规定

论文标题