通过交叉学习复制内核希尔伯特空间中的多任务增强学习

论文标题

通过交叉学习复制内核希尔伯特空间中的多任务增强学习

Multi-task Reinforcement Learning in Reproducing Kernel Hilbert Spaces via Cross-learning

论文作者

Cervino, Juan, Bazerque, Juan Andres, Calvo-Fullana, Miguel, Ribeiro, Alejandro

论文摘要

增强学习（RL）是使用系统揭示的奖励来优化控制策略的框架，这些奖励是对控制动作的响应。 RL以其标准形式涉及使用其策略来完成特定任务的单个代理。这些方法需要大量的奖励样本才能实现良好的性能，并且即使新任务相关，也可能无法概括地修改任务。在本文中，我们对一个协作计划感兴趣，在该计划中，多个具有不同任务的代理人共同优化其政策。为此，我们介绍了交叉学习，其中处理相关任务的代理商的政策被限制在彼此之间。两种属性使我们的新方法具有吸引力：（i）它产生了一个多任务中央政策，可以用作快速适应培训的任务之一的起点，在代理商不知道当前面对哪个任务的情况下，并且（ii）像在元学习中一样，它与培训中的环境相关，但在训练过程中有所不同。我们专注于属于重现内核希尔伯特空间的连续政策，我们为特定于任务的政策与交叉学习政策之间的距离绑定了距离。为了解决最终的优化问题，我们诉诸于预计的策略梯度算法，并证明它会收敛到具有很高概率的近乎最佳解决方案。我们通过一个导航示例评估了我们的方法论，在该示例中，代理可以在具有多种形状障碍的环境中移动，并避免未经训练的障碍。

Reinforcement learning (RL) is a framework to optimize a control policy using rewards that are revealed by the system as a response to a control action. In its standard form, RL involves a single agent that uses its policy to accomplish a specific task. These methods require large amounts of reward samples to achieve good performance, and may not generalize well when the task is modified, even if the new task is related. In this paper we are interested in a collaborative scheme in which multiple agents with different tasks optimize their policies jointly. To this end, we introduce cross-learning, in which agents tackling related tasks have their policies constrained to be close to one another. Two properties make our new approach attractive: (i) it produces a multi-task central policy that can be used as a starting point to adapt quickly to one of the tasks trained for, in a situation when the agent does not know which task is currently facing, and (ii) as in meta-learning, it adapts to environments related but different to those seen during training. We focus on continuous policies belonging to reproducing kernel Hilbert spaces for which we bound the distance between the task-specific policies and the cross-learned policy. To solve the resulting optimization problem, we resort to a projected policy gradient algorithm and prove that it converges to a near-optimal solution with high probability. We evaluate our methodology with a navigation example in which agents can move through environments with obstacles of multiple shapes and avoid obstacles not trained for.

下载PDF全文

下载文献需遵守相关版权规定

论文标题