隐性分配加固学习

论文标题

隐性分配加固学习

Implicit Distributional Reinforcement Learning

论文作者

Yue, Yuguang, Wang, Zhendong, Zhou, Mingyuan

论文摘要

为了提高基于政策梯度的增强学习算法的样本效率，我们提出了由分布评论家组成的内隐分布分配者 - 批评者（IDAC），该分配批评家建立在两个深层发电机网络（DGN）和一个半密码的参与者（SIA），由灵活的策略分配供电。我们对折扣累积回报进行了分布观点，并以国家行动依赖性隐式分布对其进行建模，该分布与采用州行动对和随机噪声作为其输入的DGN近似。此外，我们使用SIA提供半图表的策略分布，该分布将策略参数与不受分析密度函数限制的可重新分配分布相结合。这样，策略的边际分布是隐式的，它提供了建模复杂属性（例如协方差结构和偏度）的潜力，但仍然可以估算其参数和熵。我们将这些功能与违反政策算法框架结合在一起，以解决连续动作空间的问题，并将IDAC与代表性OpenAI健身环境中的最新算法进行比较。我们观察到IDAC在大多数任务中都优于这些基线。提供了Python代码。

To improve the sample efficiency of policy-gradient based reinforcement learning algorithms, we propose implicit distributional actor-critic (IDAC) that consists of a distributional critic, built on two deep generator networks (DGNs), and a semi-implicit actor (SIA), powered by a flexible policy distribution. We adopt a distributional perspective on the discounted cumulative return and model it with a state-action-dependent implicit distribution, which is approximated by the DGNs that take state-action pairs and random noises as their input. Moreover, we use the SIA to provide a semi-implicit policy distribution, which mixes the policy parameters with a reparameterizable distribution that is not constrained by an analytic density function. In this way, the policy's marginal distribution is implicit, providing the potential to model complex properties such as covariance structure and skewness, but its parameter and entropy can still be estimated. We incorporate these features with an off-policy algorithm framework to solve problems with continuous action space and compare IDAC with state-of-the-art algorithms on representative OpenAI Gym environments. We observe that IDAC outperforms these baselines in most tasks. Python code is provided.

下载PDF全文

下载文献需遵守相关版权规定

论文标题