MIDAS：深度学习人类行动意图预测自然眼动模式

论文标题

MIDAS：深度学习人类行动意图预测自然眼动模式

MIDAS: Deep learning human action intention prediction from natural eye movement patterns

论文作者

Festor, Paul, Shafti, Ali, Harston, Alex, Li, Michey, Orlov, Pavel, Faisal, A. Aldo

论文摘要

长期以来，人们一直将眼睛运动作为进入人脑注意机制的窗口，并使其成为新颖风格的人机接口。但是，并不是我们要凝视的一切，都是我们想要与之互动的东西。这被称为凝视界面的MIDAS触摸问题。为了克服MIDAS触摸问题，当前的界面往往不依赖自然的凝视线索，而是使用停留时间或凝视手势。在这里，我们提出了一种完全由数据驱动的方法来解码人类对物体操纵任务的意图，仅基于自然凝视提示。我们运行数据收集实验，其中16名参与者的操纵和检查任务将在其前面的各个对象上执行。使用可穿戴的眼线射击器记录了受试者的眼睛运动，使参与者可以自由地移动头并凝视现场。我们使用语义中央凹（一种卷积神经网络模型，以获取场景中的对象及其与每一帧的凝视痕迹的关系。然后，我们评估数据并检查几种模拟分类任务以进行意图预测的方法。我们的评估表明，意图预测不是数据的幼稚结果，而是依靠目光提示的非线性时间处理。我们将任务建模为时间序列分类问题，并设计双向长期记忆（LSTM）网络体系结构以解码意图。我们的结果表明，我们可以以$ 91.9 \％$的精度来解码人类运动意图纯粹是从天然凝视和对象相对位置解码。我们的工作证明了自然目光作为人机相互作用的零-UI界面的可行性，即，用户只需要自然行动，而无需与界面本身互动或偏离其自然眼动模式。

Eye movements have long been studied as a window into the attentional mechanisms of the human brain and made accessible as novelty style human-machine interfaces. However, not everything that we gaze upon, is something we want to interact with; this is known as the Midas Touch problem for gaze interfaces. To overcome the Midas Touch problem, present interfaces tend not to rely on natural gaze cues, but rather use dwell time or gaze gestures. Here we present an entirely data-driven approach to decode human intention for object manipulation tasks based solely on natural gaze cues. We run data collection experiments where 16 participants are given manipulation and inspection tasks to be performed on various objects on a table in front of them. The subjects' eye movements are recorded using wearable eye-trackers allowing the participants to freely move their head and gaze upon the scene. We use our Semantic Fovea, a convolutional neural network model to obtain the objects in the scene and their relation to gaze traces at every frame. We then evaluate the data and examine several ways to model the classification task for intention prediction. Our evaluation shows that intention prediction is not a naive result of the data, but rather relies on non-linear temporal processing of gaze cues. We model the task as a time series classification problem and design a bidirectional Long-Short-Term-Memory (LSTM) network architecture to decode intentions. Our results show that we can decode human intention of motion purely from natural gaze cues and object relative position, with $91.9\%$ accuracy. Our work demonstrates the feasibility of natural gaze as a Zero-UI interface for human-machine interaction, i.e., users will only need to act naturally, and do not need to interact with the interface itself or deviate from their natural eye movement patterns.

下载PDF全文

下载文献需遵守相关版权规定

论文标题