用于样本有效目标条件的强化学习的度量剩余网络

论文标题

用于样本有效目标条件的强化学习的度量剩余网络

Metric Residual Networks for Sample Efficient Goal-Conditioned Reinforcement Learning

论文作者

Liu, Bo, Feng, Yihao, Liu, Qiang, Stone, Peter

论文摘要

目标条件的增强学习（GCRL）具有广泛的潜在现实应用程序，包括机器人技术中的操纵和导航问题。尤其是在这样的机器人技术任务中，对GCRL的样本效率至关重要，因为默认情况下，只有在实现目标时才会获得奖励。虽然已经提出了几种方法来提高GCRL的样本效率，但一种相对研究的方法是设计神经体系结构以支持样品效率。在这项工作中，我们引入了一种新型的GCRL神经架构，该神经结构比常用的单片网络结构相比，其样品效率明显更好。关键见解是，最佳动作值函数q^*（S，A，G）必须在特定意义上满足三角形不平等。此外，我们将故意分解动作值函数Q（s，a，g）故意分解的度量残留网络（MRN）引入了否定的度量总和以及残留的不对称组件。 MRN可证明近似于任何最佳动作值函数q^*（S，A，G），从而使其成为GCRL的拟合神经结构。我们在GCRL中的12个标准基准环境中进行了全面的实验。经验结果表明，就样本效率而言，MRN均匀地优于其他最先进的GCRL神经体系结构。

Goal-conditioned reinforcement learning (GCRL) has a wide range of potential real-world applications, including manipulation and navigation problems in robotics. Especially in such robotics tasks, sample efficiency is of the utmost importance for GCRL since, by default, the agent is only rewarded when it reaches its goal. While several methods have been proposed to improve the sample efficiency of GCRL, one relatively under-studied approach is the design of neural architectures to support sample efficiency. In this work, we introduce a novel neural architecture for GCRL that achieves significantly better sample efficiency than the commonly-used monolithic network architecture. The key insight is that the optimal action-value function Q^*(s, a, g) must satisfy the triangle inequality in a specific sense. Furthermore, we introduce the metric residual network (MRN) that deliberately decomposes the action-value function Q(s,a,g) into the negated summation of a metric plus a residual asymmetric component. MRN provably approximates any optimal action-value function Q^*(s,a,g), thus making it a fitting neural architecture for GCRL. We conduct comprehensive experiments across 12 standard benchmark environments in GCRL. The empirical results demonstrate that MRN uniformly outperforms other state-of-the-art GCRL neural architectures in terms of sample efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题