论文标题
一些专家查询足以使样品有效的RL具有重置和线性值近似
A Few Expert Queries Suffices for Sample-Efficient RL with Resets and Linear Value Approximation
论文作者
论文摘要
当前的论文研究在仅假定最佳值函数可线化的设置中,在设置中样本效率增强学习(RL)。最近,人们已经理解,即使在看似强大的假设和对生成模型的访问下,最坏情况的样本复杂性也可能是庞大的(即指数)。我们调查了学习者还可以从专家政策中访问交互式演示的设置,并提出了一种统计和计算上有效的算法(DELPHI),用于将探索与专家查询融合。特别是,Delphi需要$ \ tilde {\ Mathcal {o}}(d)$专家查询和$ \ texttt {poly}(d,d,h,h,| \ nathcal {a} |,1/\ varepsilon)$可在$ \ varepsilon $ -s的策略中恢复$ \ varepsilon $ -sub。与纯RL方法相比,这对应于样品复杂性的指数改善,而专家输入令人惊讶。与先前的模仿学习(IL)方法相比,我们所需的专家演示数量独立于$ h $和$ 1/\ varepsilon $的对数,而所有先前的工作除了对$ d $的依赖外,所有两者都至少需要两者的线性因素。为了确定所需的最少的专家疑问,我们表明,在同一环境中,任何其勘探预算都具有多种多样的学习者(以$ d,h,$和$ | \ | \ m nathcal {a} | $表示,至少需要$ \ \ \tildeΩ(\ sqrt {d})$ ocal ocal ocal ocal promict Valitive proppert valitive procelt valitive factor proppert valitive procelt valitive procelt valitive procelt valitive。在较弱的假设是专家的策略是线性的,我们表明下限将增加到$ \tildeΩ(d)$。
The current paper studies sample-efficient Reinforcement Learning (RL) in settings where only the optimal value function is assumed to be linearly-realizable. It has recently been understood that, even under this seemingly strong assumption and access to a generative model, worst-case sample complexities can be prohibitively (i.e., exponentially) large. We investigate the setting where the learner additionally has access to interactive demonstrations from an expert policy, and we present a statistically and computationally efficient algorithm (Delphi) for blending exploration with expert queries. In particular, Delphi requires $\tilde{\mathcal{O}}(d)$ expert queries and a $\texttt{poly}(d,H,|\mathcal{A}|,1/\varepsilon)$ amount of exploratory samples to provably recover an $\varepsilon$-suboptimal policy. Compared to pure RL approaches, this corresponds to an exponential improvement in sample complexity with surprisingly-little expert input. Compared to prior imitation learning (IL) approaches, our required number of expert demonstrations is independent of $H$ and logarithmic in $1/\varepsilon$, whereas all prior work required at least linear factors of both in addition to the same dependence on $d$. Towards establishing the minimal amount of expert queries needed, we show that, in the same setting, any learner whose exploration budget is polynomially-bounded (in terms of $d,H,$ and $|\mathcal{A}|$) will require at least $\tildeΩ(\sqrt{d})$ oracle calls to recover a policy competing with the expert's value function. Under the weaker assumption that the expert's policy is linear, we show that the lower bound increases to $\tildeΩ(d)$.