论文标题
学习何时切换:编写控制器以遍历一系列地形伪像
Learning When to Switch: Composing Controllers to Traverse a Sequence of Terrain Artifacts
论文作者
论文摘要
腿部机器人通常使用单独的控制PoliciestHat是高度设计的,用于遍历困难的地形,缝隙和步骤,当机器人位于commonto邻近控制器的区域中时,在政策之间进行切换。深钢筋学习(DRL)是手工制作的控制设计的有前途的替代方法,尽管通常需要在训练前要知道的完整测试条件。 DRL策略可能会导致复杂的(通常是不现实的)行为,这些行为在相邻策略之间几乎没有或没有重叠的行为,因此很难切换行为。在这项工作中,我们在课程学习(CL)的情况下制定了多个DRL策略,每个DRL策略都可以穿越各自的地形条件,同时确保策略之间有关键的范围。然后,我们为每个目的地培训一个网络,以估计成功切换到任何其他策略的可能性。我们评估了我们的开关方法,一种以前看不见的地形工件和表演的组合,其性能比启发式方法更好。尽管对单个地形类型进行了培训,但它与在整个链条件下训练的深Q网络相比可行。这种方法允许在受约束条件下制定分离政策,并具有嵌入式的对每种行为的嵌入式策略,这是对任何数字行为的可扩展性,并为现实世界中的应用程序准备DRL方法
Legged robots often use separate control policiesthat are highly engineered for traversing difficult terrain suchas stairs, gaps, and steps, where switching between policies isonly possible when the robot is in a region that is commonto adjacent controllers. Deep Reinforcement Learning (DRL)is a promising alternative to hand-crafted control design,though typically requires the full set of test conditions to beknown before training. DRL policies can result in complex(often unrealistic) behaviours that have few or no overlappingregions between adjacent policies, making it difficult to switchbehaviours. In this work we develop multiple DRL policieswith Curriculum Learning (CL), each that can traverse asingle respective terrain condition, while ensuring an overlapbetween policies. We then train a network for each destinationpolicy that estimates the likelihood of successfully switchingfrom any other policy. We evaluate our switching methodon a previously unseen combination of terrain artifacts andshow that it performs better than heuristic methods. Whileour method is trained on individual terrain types, it performscomparably to a Deep Q Network trained on the full set ofterrain conditions. This approach allows the development ofseparate policies in constrained conditions with embedded priorknowledge about each behaviour, that is scalable to any numberof behaviours, and prepares DRL methods for applications inthe real world