一些专家查询足以使样品有效的RL具有重置和线性值近似

论文标题

一些专家查询足以使样品有效的RL具有重置和线性值近似

A Few Expert Queries Suffices for Sample-Efficient RL with Resets and Linear Value Approximation

论文作者

Amortila, Philip, Jiang, Nan, Madeka, Dhruv, Foster, Dean P.

论文摘要

当前的论文研究在仅假定最佳值函数可线化的设置中，在设置中样本效率增强学习（RL）。最近，人们已经理解，即使在看似强大的假设和对生成模型的访问下，最坏情况的样本复杂性也可能是庞大的（即指数）。我们调查了学习者还可以从专家政策中访问交互式演示的设置，并提出了一种统计和计算上有效的算法（DELPHI），用于将探索与专家查询融合。特别是，Delphi需要$ \ tilde {\ Mathcal {o}}（d）$专家查询和$ \ texttt {poly}（d，d，h，h，| \ nathcal {a} |，1/\ varepsilon）$可在$ \ varepsilon $ -s的策略中恢复$ \ varepsilon $ -sub。与纯RL方法相比，这对应于样品复杂性的指数改善，而专家输入令人惊讶。与先前的模仿学习（IL）方法相比，我们所需的专家演示数量独立于$ h $和$ 1/\ varepsilon $的对数，而所有先前的工作除了对$ d $的依赖外，所有两者都至少需要两者的线性因素。为了确定所需的最少的专家疑问，我们表明，在同一环境中，任何其勘探预算都具有多种多样的学习者（以$ d，h，$和$ | \ | \ m nathcal {a} | $表示，至少需要$ \ \ \tildeΩ（\ sqrt {d}）$ ocal ocal ocal ocal promict Valitive proppert valitive procelt valitive factor proppert valitive procelt valitive procelt valitive procelt valitive。在较弱的假设是专家的策略是线性的，我们表明下限将增加到$ \tildeΩ（d）$。

The current paper studies sample-efficient Reinforcement Learning (RL) in settings where only the optimal value function is assumed to be linearly-realizable. It has recently been understood that, even under this seemingly strong assumption and access to a generative model, worst-case sample complexities can be prohibitively (i.e., exponentially) large. We investigate the setting where the learner additionally has access to interactive demonstrations from an expert policy, and we present a statistically and computationally efficient algorithm (Delphi) for blending exploration with expert queries. In particular, Delphi requires $\tilde{\mathcal{O}}(d)$ expert queries and a $\texttt{poly}(d,H,|\mathcal{A}|,1/\varepsilon)$ amount of exploratory samples to provably recover an $\varepsilon$-suboptimal policy. Compared to pure RL approaches, this corresponds to an exponential improvement in sample complexity with surprisingly-little expert input. Compared to prior imitation learning (IL) approaches, our required number of expert demonstrations is independent of $H$ and logarithmic in $1/\varepsilon$, whereas all prior work required at least linear factors of both in addition to the same dependence on $d$. Towards establishing the minimal amount of expert queries needed, we show that, in the same setting, any learner whose exploration budget is polynomially-bounded (in terms of $d,H,$ and $|\mathcal{A}|$) will require at least $\tildeΩ(\sqrt{d})$ oracle calls to recover a policy competing with the expert's value function. Under the weaker assumption that the expert's policy is linear, we show that the lower bound increases to $\tildeΩ(d)$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题