多任务表示学习在强化学习中的可证明的好处

论文标题

多任务表示学习在强化学习中的可证明的好处

Provable Benefit of Multitask Representation Learning in Reinforcement Learning

论文作者

Cheng, Yuan, Feng, Songtao, Yang, Jing, Zhang, Hong, Liang, Yingbin

论文摘要

随着代表性学习成为一种在实践中降低增强学习（RL）中样本复杂性（RL）的强大技术，对其优势的理论理解仍然是有限的。在本文中，我们从理论上表征了在低级马尔可夫决策过程（MDP）模型下表示学习的好处。我们首先研究多任务低级RL（作为上游培训），所有任务都共享一个共同的表示，并提出了一种称为“加油”的新型多任务奖励算法。加油可以学习每个任务的过渡内核和近乎最佳的政策，并为下游任务输出良好的代表。我们的结果表明，只要任务总数高于一定的阈值，多任务表示学习比单独学习的样本效率要高。然后，我们研究在线和离线设置中的下游RL，在该设置中，代理商分配了一个新任务，共享与上游任务相同的表示形式。对于在线和离线设置，我们都会开发出样本效率高的算法，并表明它发现了一个近乎最佳的策略，其次要差距受到上游中学表示的估计误差和消失项的估计误差的总和，因为下游样品的数量变得很大。我们在线和离线RL的下游结果进一步捕获了从上游采用学习的表示形式的好处，而不是直接学习低级模型的表示。据我们所知，这是第一项理论研究，它表征了代表性学习在基于探索的无奖励多任务RL中对上游和下游任务的好处。

As representation learning becomes a powerful technique to reduce sample complexity in reinforcement learning (RL) in practice, theoretical understanding of its advantage is still limited. In this paper, we theoretically characterize the benefit of representation learning under the low-rank Markov decision process (MDP) model. We first study multitask low-rank RL (as upstream training), where all tasks share a common representation, and propose a new multitask reward-free algorithm called REFUEL. REFUEL learns both the transition kernel and the near-optimal policy for each task, and outputs a well-learned representation for downstream tasks. Our result demonstrates that multitask representation learning is provably more sample-efficient than learning each task individually, as long as the total number of tasks is above a certain threshold. We then study the downstream RL in both online and offline settings, where the agent is assigned with a new task sharing the same representation as the upstream tasks. For both online and offline settings, we develop a sample-efficient algorithm, and show that it finds a near-optimal policy with the suboptimality gap bounded by the sum of the estimation error of the learned representation in upstream and a vanishing term as the number of downstream samples becomes large. Our downstream results of online and offline RL further capture the benefit of employing the learned representation from upstream as opposed to learning the representation of the low-rank model directly. To the best of our knowledge, this is the first theoretical study that characterizes the benefit of representation learning in exploration-based reward-free multitask RL for both upstream and downstream tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题