SGL：在人类教学的混合模块化框架中学习的符号目标学习。

论文标题

SGL：在人类教学的混合模块化框架中学习的符号目标学习。

SGL: Symbolic Goal Learning in a Hybrid, Modular Framework for Human Instruction Following

论文作者

Xu, Ruinian, Chen, Hongyi, Lin, Yunzhi, Vela, Patricio A.

论文摘要

本文根据人类的指示调查了机器人操纵，并提出了模棱两可的要求。目的是通过视觉观察来补偿不完善的自然语言。早期的符号方法基于手动定义的符号，构建的模块化框架由语义解析和任务计划组成，用于从自然语言请求中产生一系列动作序列。现代连接主义方法采用深层神经网络来自动学习视觉和语言特征，并以末端的方式映射到一系列低级动作。将这两种方法融合在一起以创建一个混合的模块化框架：它通过深层神经网络进行了指导为符号目标学习，然后通过符号规划师进行任务计划。连接师和符号模块与计划域定义语言桥接。视觉和语言学习网络预测了其目标表示，该目标表示为制作任务完成的动作序列的计划者。为了提高自然语言的灵活性，我们进一步将人类意图与明确的人类指示结合在一起。为了学习视觉和语言的通用功能，我们建议在场景图解和语义文本相似性任务上分别预处理视觉和语言编码。基准测试评估视力和语言学习模型的不同组成部分或选项的影响，并显示了训练策略的有效性。在模拟器AI2中进行的操纵实验显示了框架对新场景的鲁棒性。

This paper investigates robot manipulation based on human instruction with ambiguous requests. The intent is to compensate for imperfect natural language via visual observations. Early symbolic methods, based on manually defined symbols, built modular framework consist of semantic parsing and task planning for producing sequences of actions from natural language requests. Modern connectionist methods employ deep neural networks to automatically learn visual and linguistic features and map to a sequence of low-level actions, in an endto-end fashion. These two approaches are blended to create a hybrid, modular framework: it formulates instruction following as symbolic goal learning via deep neural networks followed by task planning via symbolic planners. Connectionist and symbolic modules are bridged with Planning Domain Definition Language. The vision-and-language learning network predicts its goal representation, which is sent to a planner for producing a task-completing action sequence. For improving the flexibility of natural language, we further incorporate implicit human intents with explicit human instructions. To learn generic features for vision and language, we propose to separately pretrain vision and language encoders on scene graph parsing and semantic textual similarity tasks. Benchmarking evaluates the impacts of different components of, or options for, the vision-and-language learning model and shows the effectiveness of pretraining strategies. Manipulation experiments conducted in the simulator AI2THOR show the robustness of the framework to novel scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题