论文标题

jarvis:一种神经符号常识性推理框架

JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

论文作者

Zheng, Kaizhi, Zhou, Kaiwen, Gu, Jing, Fan, Yue, Wang, Jialu, Di, Zonglin, He, Xuehai, Wang, Xin Eric

论文摘要

建立一个对话性体现的代理执行现实生活任务一直是一个长期而又具有挑战性的研究目标,因为它需要有效的人类代理沟通,多模式的理解,远距离的顺序决策做出等。传统的符号方法具有缩放和泛化问题,而端到端的深度学习模型却遭受了数据稀缺和高任务复杂性,并且很难解释。为了从两全其美的世界中受益,我们提出了Jarvis,这是一个神经符号常识性推理框架,用于模块化,可推广和可解释的对话体现的代理。首先,它通过提示大型语言模型(LLM)来获得符号表示,以了解语言理解和次目标计划,并通过从视觉观察中构建语义图。然后,基于任务和动作级别的常识,次目标计划和行动生成的符号模块。 Extensive experiments on the TEACh dataset validate the efficacy and efficiency of our JARVIS framework, which achieves state-of-the-art (SOTA) results on all three dialog-based embodied tasks, including Execution from Dialog History (EDH), Trajectory from Dialog (TfD), and Two-Agent Task Completion (TATC) (e.g., our method boosts the unseen Success Rate on EDH from 6.1\% to 15.8 \%)。此外,我们系统地分析了影响任务绩效的基本因素,并在几次射击设置中证明了我们方法的优越性。我们的Jarvis模型在Alexa奖Simbot公共基准挑战赛中排名第一。

Building a conversational embodied agent to execute real-life tasks has been a long-standing yet quite challenging research goal, as it requires effective human-agent communication, multi-modal understanding, long-range sequential decision making, etc. Traditional symbolic methods have scaling and generalization issues, while end-to-end deep learning models suffer from data scarcity and high task complexity, and are often hard to explain. To benefit from both worlds, we propose JARVIS, a neuro-symbolic commonsense reasoning framework for modular, generalizable, and interpretable conversational embodied agents. First, it acquires symbolic representations by prompting large language models (LLMs) for language understanding and sub-goal planning, and by constructing semantic maps from visual observations. Then the symbolic module reasons for sub-goal planning and action generation based on task- and action-level common sense. Extensive experiments on the TEACh dataset validate the efficacy and efficiency of our JARVIS framework, which achieves state-of-the-art (SOTA) results on all three dialog-based embodied tasks, including Execution from Dialog History (EDH), Trajectory from Dialog (TfD), and Two-Agent Task Completion (TATC) (e.g., our method boosts the unseen Success Rate on EDH from 6.1\% to 15.8\%). Moreover, we systematically analyze the essential factors that affect the task performance and also demonstrate the superiority of our method in few-shot settings. Our JARVIS model ranks first in the Alexa Prize SimBot Public Benchmark Challenge.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源