通过估计示威者的专业知识来模仿学习

论文标题

通过估计示威者的专业知识来模仿学习

Imitation Learning by Estimating Expertise of Demonstrators

论文作者

Beliaev, Mark, Shih, Andy, Ermon, Stefano, Sadigh, Dorsa, Pedarsani, Ramtin

论文摘要

许多现有的模仿学习数据集都是从多个演示者那里收集的，每个示威者在环境的不同部分都有不同的专业知识。然而，标准模仿学习算法通常将所有示威者视为同质的，无论其专业知识如何，都会吸收任何次优示威者的弱点。在这项工作中，我们表明，对演示者专业知识的无监督学习可以导致模仿学习算法的性能一致。我们在示威者的学习政策和专业知识水平上开发并优化了联合模型。这使我们的模型能够从最佳行为中学习，并过滤每个演示者的次优行为。我们的模型学习了一个单一的政策，即使是最好的演示者，也可以用来估算任何州任何演示者的专业知识。我们说明了我们从机器人和离散环境（例如Minigrid和国际象棋）的真实性连续控制任务的发现，以$ 23 $的设置为$ 21 $，平均$ 7 \％\％\％，最高$ 60 \％$ $ $ \％。

Many existing imitation learning datasets are collected from multiple demonstrators, each with different expertise at different parts of the environment. Yet, standard imitation learning algorithms typically treat all demonstrators as homogeneous, regardless of their expertise, absorbing the weaknesses of any suboptimal demonstrators. In this work, we show that unsupervised learning over demonstrator expertise can lead to a consistent boost in the performance of imitation learning algorithms. We develop and optimize a joint model over a learned policy and expertise levels of the demonstrators. This enables our model to learn from the optimal behavior and filter out the suboptimal behavior of each demonstrator. Our model learns a single policy that can outperform even the best demonstrator, and can be used to estimate the expertise of any demonstrator at any state. We illustrate our findings on real-robotic continuous control tasks from Robomimic and discrete environments such as MiniGrid and chess, out-performing competing methods in $21$ out of $23$ settings, with an average of $7\%$ and up to $60\%$ improvement in terms of the final reward.

下载PDF全文

下载文献需遵守相关版权规定

论文标题