A2C是PPO的特殊情况

论文标题

A2C是PPO的特殊情况

A2C is a special case of PPO

论文作者

Huang, Shengyi, Kanervisto, Anssi, Raffin, Antonin, Wang, Weixun, Ontañón, Santiago, Dossa, Rousslan Fernand Julien

论文摘要

近年来，Advantage Actor-Critic（A2C）和近端政策优化（PPO）是流行的深入强化学习算法。一个普遍的理解是A2C和PPO是单独的算法，因为PPO的剪辑目标似乎与A2C的目标显着不同。但是，在本文中，我们显示A2C是PPO的特殊情况。我们提出理论上的理由和伪代码分析，以证明原因。为了验证我们的主张，我们使用\ texttt {稳定的baselines3}进行经验实验，显示A2C和PPO在控制其他设置时生成\ textit {cresst}相同的模型。

Advantage Actor-critic (A2C) and Proximal Policy Optimization (PPO) are popular deep reinforcement learning algorithms used for game AI in recent years. A common understanding is that A2C and PPO are separate algorithms because PPO's clipped objective appears significantly different than A2C's objective. In this paper, however, we show A2C is a special case of PPO. We present theoretical justifications and pseudocode analysis to demonstrate why. To validate our claim, we conduct an empirical experiment using \texttt{Stable-baselines3}, showing A2C and PPO produce the \textit{exact} same models when other settings are controlled.

下载PDF全文

下载文献需遵守相关版权规定

论文标题