在准精液时间计划可观察到的POMDP

论文标题

在准精液时间计划可观察到的POMDP

Planning in Observable POMDPs in Quasipolynomial Time

论文作者

Golowich, Noah, Moitra, Ankur, Rohatgi, Dhruv

论文摘要

部分可观察到的马尔可夫决策过程（POMDP）是强化学习中的自然和一般模型，考虑到代理人对其当前状态的不确定性。在有关POMDP的文献中，习惯性地假设访问一个已知参数时计算最佳策略的规划Oracle，即使已知问题在计算上很难。几乎所有现有的计划算法要么在指数时间内运行，缺乏可证明的性能保证，要么需要在每个可能的政策下对过渡动态进行强有力的假设。在这项工作中，我们重新审视了计划问题，并提出：是否有自然且动机良好的假设使计划变得容易？我们的主要结果是用于计划（一步）可观察的POMDP的准化性时间算法。具体而言，我们假设在状态上进行分离良好的分布会导致观察结果分离良好的分布，因此观察值在每个步骤中至少有所帮助。至关重要的是，此假设对POMDP的过渡动力学没有任何限制。然而，这意味着近乎最佳的政策承认了准确的描述，这通常不正确（在标准硬度假设下）。我们的分析基于用于滤波器稳定性的新定量界限 - 即潜在状态忘记其初始化的最佳过滤器的速率。此外，我们证明了在指数时间假设下可观察到的POMDP计划的匹配硬度。

Partially Observable Markov Decision Processes (POMDPs) are a natural and general model in reinforcement learning that take into account the agent's uncertainty about its current state. In the literature on POMDPs, it is customary to assume access to a planning oracle that computes an optimal policy when the parameters are known, even though the problem is known to be computationally hard. Almost all existing planning algorithms either run in exponential time, lack provable performance guarantees, or require placing strong assumptions on the transition dynamics under every possible policy. In this work, we revisit the planning problem and ask: are there natural and well-motivated assumptions that make planning easy? Our main result is a quasipolynomial-time algorithm for planning in (one-step) observable POMDPs. Specifically, we assume that well-separated distributions on states lead to well-separated distributions on observations, and thus the observations are at least somewhat informative in each step. Crucially, this assumption places no restrictions on the transition dynamics of the POMDP; nevertheless, it implies that near-optimal policies admit quasi-succinct descriptions, which is not true in general (under standard hardness assumptions). Our analysis is based on new quantitative bounds for filter stability -- i.e. the rate at which an optimal filter for the latent state forgets its initialization. Furthermore, we prove matching hardness for planning in observable POMDPs under the Exponential Time Hypothesis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题