论文标题

占用信息比:无限 - 摩恩,信息定向,参数化策略搜索

Occupancy Information Ratio: Infinite-Horizon, Information-Directed, Parameterized Policy Search

论文作者

Suttle, Wesley A., Koppel, Alec, Liu, Ji

论文摘要

在这项工作中,我们提出了一个信息指导的无限 - 摩尼斯强化学习(RL)的目标,称为占用信息比(OIR),灵感来自以前信息指导的多军群和马尔可夫决策过程中的信息比率目标的信息比率目标,以及一般公用事业RL的最新进展。 OIR由政策的平均成本与其诱发州占用度量的熵之间的比率组成,享有丰富的基础结构,并提出了一个目标,可以自然适用可扩展的,无模型的策略搜索方法。具体而言,我们通过利用Quasiconcave优化和Markov决策过程的线性编程理论之间的连接来显示,可以通过凹面编程方法来转换和解决oir问题。由于通常在实践中缺乏模型知识,因此我们通过建立相应的策略梯度定理来为无模型OIR策略搜索方法奠定基础。在此结果的基础上,我们随后得出了用于解决策略参数空间中OIR问题的增强和参与者风格的算法。至关重要的是,我们利用OIR问题的凹面编程转换所隐含的强大隐藏的准清腔属性,我们建立了增强风格方案的有限时间收敛,以在适当条件下(在适当条件下(近)全球最佳效果(接近)全球最佳方案的全球最佳和渐近融合。最后,我们在实验上说明了在稀疏奖励设置中基于OIR方法对香草方法的实用性,从而支持OIR作为现有RL目标的替代方案。

In this work, we propose an information-directed objective for infinite-horizon reinforcement learning (RL), called the occupancy information ratio (OIR), inspired by the information ratio objectives used in previous information-directed sampling schemes for multi-armed bandits and Markov decision processes as well as recent advances in general utility RL. The OIR, comprised of a ratio between the average cost of a policy and the entropy of its induced state occupancy measure, enjoys rich underlying structure and presents an objective to which scalable, model-free policy search methods naturally apply. Specifically, we show by leveraging connections between quasiconcave optimization and the linear programming theory for Markov decision processes that the OIR problem can be transformed and solved via concave programming methods when the underlying model is known. Since model knowledge is typically lacking in practice, we lay the foundations for model-free OIR policy search methods by establishing a corresponding policy gradient theorem. Building on this result, we subsequently derive REINFORCE- and actor-critic-style algorithms for solving the OIR problem in policy parameter space. Crucially, exploiting the powerful hidden quasiconcavity property implied by the concave programming transformation of the OIR problem, we establish finite-time convergence of the REINFORCE-style scheme to global optimality and asymptotic convergence of the actor-critic-style scheme to (near) global optimality under suitable conditions. Finally, we experimentally illustrate the utility of OIR-based methods over vanilla methods in sparse-reward settings, supporting the OIR as an alternative to existing RL objectives.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源