智能社会接待员机器人的主动互动框架

论文标题

智能社会接待员机器人的主动互动框架

Proactive Interaction Framework for Intelligent Social Receptionist Robots

论文作者

Xue, Yang, Wang, Fan, Tian, Hao, Zhao, Min, Li, Jiangyong, Pan, Haiqing, Dong, Yueqiang

论文摘要

积极主动的人类机器人互动（HRI）允许接待员机器人积极向人们致意，并根据视觉提供服务，这已被发现可以提高可接受性和客户满意度。现有方法要么基于多阶段决策过程，要么基于端到端决策模型。但是，基于规则的方法需要巧妙的专家努力，并且仅处理最少的预定义场景。另一方面，现有的端到端模型的作品仅限于非常通用的问候或几个行为模式（通常小于10）。为了应对这些挑战，我们提出了一个新的端到端框架，即具有视觉令牌的人类机器人相互作用的变压器（TFVT-HRI）。提出的框架首先从RGB摄像机中提取相对对象的视觉令牌。为了确保对场景的正确解释，然后使用变压器决策模型来处理视觉令牌，并使用时间和空间信息进行增强。它可以预测在每种情况下采取的适当行动并确定正确的目标。我们的数据是从办公室建筑物中的服务接待机器人那里收集的，然后由专家注释以进行适当的主动行为。该动作集包括通过结合语言，表情表达和身体运动的1000多种模式。我们将模型与脱机测试集和现实办公室建筑环境中的在线用户实验的其他SOTA端到端模型进行了比较，以验证该框架。证明决策模型在动作触发和选择中实现了SOTA性能，与先前的反应性接收策略相比，在动作触发和选择中实现了更多的人性和智力。

Proactive human-robot interaction (HRI) allows the receptionist robots to actively greet people and offer services based on vision, which has been found to improve acceptability and customer satisfaction. Existing approaches are either based on multi-stage decision processes or based on end-to-end decision models. However, the rule-based approaches require sedulous expert efforts and only handle minimal pre-defined scenarios. On the other hand, existing works with end-to-end models are limited to very general greetings or few behavior patterns (typically less than 10). To address those challenges, we propose a new end-to-end framework, the TransFormer with Visual Tokens for Human-Robot Interaction (TFVT-HRI). The proposed framework extracts visual tokens of relative objects from an RGB camera first. To ensure the correct interpretation of the scenario, a transformer decision model is then employed to process the visual tokens, which is augmented with the temporal and spatial information. It predicts the appropriate action to take in each scenario and identifies the right target. Our data is collected from an in-service receptionist robot in an office building, which is then annotated by experts for appropriate proactive behavior. The action set includes 1000+ diverse patterns by combining language, emoji expression, and body motions. We compare our model with other SOTA end-to-end models on both offline test sets and online user experiments in realistic office building environments to validate this framework. It is demonstrated that the decision model achieves SOTA performance in action triggering and selection, resulting in more humanness and intelligence when compared with the previous reactive reception policies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题