论文标题
通用公用事业的强化学习政策梯度
Policy Gradient for Reinforcement Learning with General Utilities
论文作者
论文摘要
在加强学习(RL)中,代理的目标是发现最大化预期累积奖励的最佳政策。该目标也可能被视为找到一项优化其州行动占用度量线性函数的策略,此后称为线性RL。但是,线性RL框架中未涵盖许多受监督和无监督的RL问题,例如学徒学习,纯粹的探索和各种内在控制,因为目标是占用度量的非线性函数。带有非线性实用程序的RL看起来很笨拙,因为Bellman方程,价值迭代,策略梯度,动态编程等方法在线性RL中取得了巨大成功,但无法轻描淡写。在本文中,我们将带有通用公用事业的RL的策略梯度定理。政策梯度定理被证明是线性RL中的基石,因为它的优雅性和易于实施性。我们与通用公用事业的RL的政策梯度定理具有相同的优雅性和易于实施性。根据得出的策略梯度定理,我们还提出了一种简单的基于样本的算法。我们认为,我们的结果将引起社区感兴趣,并在这种广义环境中为未来的作品提供灵感。
In Reinforcement Learning (RL), the goal of agents is to discover an optimal policy that maximizes the expected cumulative rewards. This objective may also be viewed as finding a policy that optimizes a linear function of its state-action occupancy measure, hereafter referred as Linear RL. However, many supervised and unsupervised RL problems are not covered in the Linear RL framework, such as apprenticeship learning, pure exploration and variational intrinsic control, where the objectives are non-linear functions of the occupancy measures. RL with non-linear utilities looks unwieldy, as methods like Bellman equation, value iteration, policy gradient, dynamic programming that had tremendous success in Linear RL, fail to trivially generalize. In this paper, we derive the policy gradient theorem for RL with general utilities. The policy gradient theorem proves to be a cornerstone in Linear RL due to its elegance and ease of implementability. Our policy gradient theorem for RL with general utilities shares the same elegance and ease of implementability. Based on the policy gradient theorem derived, we also present a simple sample-based algorithm. We believe our results will be of interest to the community and offer inspiration to future works in this generalized setting.