数据驱动和强化学习系统的人类在环境中

论文标题

数据驱动和强化学习系统的人类在环境中

Human-in-the-Loop Methods for Data-Driven and Reinforcement Learning Systems

论文作者

Goecks, Vinicius G.

论文摘要

尽管强化学习并未广泛应用于机器人技术和现实世界的情景，但最近的成功结合了增强学习算法和深层神经网络。这可以归因于以下事实：当前的最新，端到端的强化学习方法仍然需要数千或数百万的数据样本才能收集到令人满意的政策，并在培训期间会遭受灾难性的失败。相反，在现实世界的情况下，仅在几个数据样本之后，人类就可以提供任务的演示，干预以防止灾难性的行动，或者简单地评估该策略是否正确执行。这项研究调查了如何将这些人类互动方式整合到增强学习循环，提高样本效率并实现机器人技术和现实世界情景中的实时增强学习。这个新颖的理论基础称为学习周期，这是对不同人类相互作用方式（即任务演示，干预和评估）如何被循环并结合到增强学习算法的情况。这项工作中提出的结果表明，基于人类相互作用学习的奖励信号加速了学习算法的学习率，并且与传统的监督学习算法相比，从人类示范和干预措施的组合中学习的速度更快，更有效。最后，学习周期在使用人类示范和强化学习的干预措施中学到的政策之间发展了有效的过渡。这项研究开发的理论基础为人类代理团队的场景开辟了新的研究途径，在该方案中，自主代理可以向人类队友学习并在现实世界中适应任务绩效指标。

Recent successes combine reinforcement learning algorithms and deep neural networks, despite reinforcement learning not being widely applied to robotics and real world scenarios. This can be attributed to the fact that current state-of-the-art, end-to-end reinforcement learning approaches still require thousands or millions of data samples to converge to a satisfactory policy and are subject to catastrophic failures during training. Conversely, in real world scenarios and after just a few data samples, humans are able to either provide demonstrations of the task, intervene to prevent catastrophic actions, or simply evaluate if the policy is performing correctly. This research investigates how to integrate these human interaction modalities to the reinforcement learning loop, increasing sample efficiency and enabling real-time reinforcement learning in robotics and real world scenarios. This novel theoretical foundation is called Cycle-of-Learning, a reference to how different human interaction modalities, namely, task demonstration, intervention, and evaluation, are cycled and combined to reinforcement learning algorithms. Results presented in this work show that the reward signal that is learned based upon human interaction accelerates the rate of learning of reinforcement learning algorithms and that learning from a combination of human demonstrations and interventions is faster and more sample efficient when compared to traditional supervised learning algorithms. Finally, Cycle-of-Learning develops an effective transition between policies learned using human demonstrations and interventions to reinforcement learning. The theoretical foundation developed by this research opens new research paths to human-agent teaming scenarios where autonomous agents are able to learn from human teammates and adapt to mission performance metrics in real-time and in real world scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题