从异质演示中终身学习中的策略发现和混合

论文标题

从异质演示中终身学习中的策略发现和混合

Strategy Discovery and Mixture in Lifelong Learning from Heterogeneous Demonstration

论文作者

Jayanthi, Sravan, Chen, Letian, Gombolay, Matthew

论文摘要

从示范中学习（LFD）方法使最终用户能够通过演示所需的行为来教机器人新任务，从而使对机器人技术的访问民主化。 LFD研究中的一个主要挑战是，由于各种策略和偏好，用户倾向于为同一任务提供异质示威。因此，开发LFD算法确保\ textIt {fortibimity}（机器人适应个性化策略），\ textit {效率}（机器人实现样本效率调整）和\ textit {scotobility}（机器人重复使用策略来代表大型行为）。在本文中，我们提出了一种新颖的算法，动态的多态奖励蒸馏（DMSRD），该蒸馏（DMSRD）在异质示范之间提炼了常识，利用了学习构建混合策略的策略，并继续通过从所有可用数据中学习来改善。我们个性化，联合和终身LFD体系结构在两个连续的控制问题中超过了基准，平均有77 \％的政策收益率提高了77 \％，并提高了日志可能性的改善，以及更强的任务奖励相关性和更精确的策略。

Learning from Demonstration (LfD) approaches empower end-users to teach robots novel tasks via demonstrations of the desired behaviors, democratizing access to robotics. A key challenge in LfD research is that users tend to provide heterogeneous demonstrations for the same task due to various strategies and preferences. Therefore, it is essential to develop LfD algorithms that ensure \textit{flexibility} (the robot adapts to personalized strategies), \textit{efficiency} (the robot achieves sample-efficient adaptation), and \textit{scalability} (robot reuses a concise set of strategies to represent a large amount of behaviors). In this paper, we propose a novel algorithm, Dynamic Multi-Strategy Reward Distillation (DMSRD), which distills common knowledge between heterogeneous demonstrations, leverages learned strategies to construct mixture policies, and continues to improve by learning from all available data. Our personalized, federated, and lifelong LfD architecture surpasses benchmarks in two continuous control problems with an average 77\% improvement in policy returns and 42\% improvement in log likelihood, alongside stronger task reward correlation and more precise strategy rewards.

下载PDF全文

下载文献需遵守相关版权规定

论文标题