论文标题
A2C是PPO的特殊情况
A2C is a special case of PPO
论文作者
论文摘要
近年来,Advantage Actor-Critic(A2C)和近端政策优化(PPO)是流行的深入强化学习算法。一个普遍的理解是A2C和PPO是单独的算法,因为PPO的剪辑目标似乎与A2C的目标显着不同。但是,在本文中,我们显示A2C是PPO的特殊情况。我们提出理论上的理由和伪代码分析,以证明原因。为了验证我们的主张,我们使用\ texttt {稳定的baselines3}进行经验实验,显示A2C和PPO在控制其他设置时生成\ textit {cresst}相同的模型。
Advantage Actor-critic (A2C) and Proximal Policy Optimization (PPO) are popular deep reinforcement learning algorithms used for game AI in recent years. A common understanding is that A2C and PPO are separate algorithms because PPO's clipped objective appears significantly different than A2C's objective. In this paper, however, we show A2C is a special case of PPO. We present theoretical justifications and pseudocode analysis to demonstrate why. To validate our claim, we conduct an empirical experiment using \texttt{Stable-baselines3}, showing A2C and PPO produce the \textit{exact} same models when other settings are controlled.