论文标题
视觉对象操纵关键点预测模型的多模式学习
Multi-Modal Learning of Keypoint Predictive Models for Visual Object Manipulation
论文作者
论文摘要
在完全新颖的环境中操纵对象和工具时,人类具有令人印象深刻的概括能力。这些功能至少部分是人类具有内部模型和任何抓紧物体的结果。如何学习这种机器人的身体模式仍然是一个开放的问题。在这项工作中,我们开发了一种自我监督的方法,该方法可以在视觉潜在表示中抓住对象时可以扩展机器人的运动学模型。我们的框架包括两个组成部分:(1)我们提出了一个多模式关键点检测器:一种自动编码器体系结构,该体系结构通过融合的本体感受和愿景训练,以预测对象上的视觉关键点; (2)我们展示了如何通过从预测的Visual Kepoints中回归虚拟关节来使用我们学到的Kepoint检测器来学习运动链的扩展。我们的评估表明,我们的方法学会了一致地预测操纵器手中对象上的视觉关键点,因此可以轻松地促进学习扩展的运动学链,以在几秒钟的视觉数据中包括各种配置中的对象。最后,我们表明,这款扩展的运动链为对象操纵任务提供了自身,例如将握把的对象和当前实验放在仿真和硬件上。
Humans have impressive generalization capabilities when it comes to manipulating objects and tools in completely novel environments. These capabilities are, at least partially, a result of humans having internal models of their bodies and any grasped object. How to learn such body schemas for robots remains an open problem. In this work, we develop an self-supervised approach that can extend a robot's kinematic model when grasping an object from visual latent representations. Our framework comprises two components: (1) we present a multi-modal keypoint detector: an autoencoder architecture trained by fusing proprioception and vision to predict visual key points on an object; (2) we show how we can use our learned keypoint detector to learn an extension of the kinematic chain by regressing virtual joints from the predicted visual keypoints. Our evaluation shows that our approach learns to consistently predict visual keypoints on objects in the manipulator's hand, and thus can easily facilitate learning an extended kinematic chain to include the object grasped in various configurations, from a few seconds of visual data. Finally we show that this extended kinematic chain lends itself for object manipulation tasks such as placing a grasped object and present experiments in simulation and on hardware.