通过长期内存增强网络进行现场手势分类

论文标题

通过长期内存增强网络进行现场手势分类

In-Place Gestures Classification via Long-term Memory Augmented Network

论文作者

Zhao, Lizhi, Lu, Xuequan, Bao, Qianyue, Wang, Meili

论文摘要

基于手势的虚拟运动技术使用户能够控制自己的观点并在3D虚拟环境中直观地移动。一个关键的研究问题是准确，快速识别就地手势，因为它们可以触发虚拟观点的特定运动并增强用户体验。但是，为了获得实时体验，只能将短期传感器序列数据（约300ms，6至10帧）作为输入，这实际上会影响由于时空信息有限而导致的分类性能。在本文中，我们提出了一个新型的长期记忆增强网络，以用于现场手势分类。它以短期手势序列样本及其相应的长期序列样本为输入，它们在训练阶段提供了额外相关的时空信息。我们以外部内存队列存储长期序列特征。此外，我们设计了一个内存增强的损失，以帮助同一类的聚类特征，并将不同类别的特征推开，从而使我们的内存队列记住更相关的长期序列特征。在推论阶段，我们仅输入短期序列样本以相应地回忆存储的特征，并将它们融合在一起以预测手势类别。我们创建了一个有11个手势的参与者的大规模的场地手势数据集。我们的方法以192ms的潜伏期达到95.1％的有希望的准确性，准确性为97.3％，潜伏期为312毫秒，并且被证明优于最近的现场手势分类技术。用户研究还验证了我们的方法。我们的源代码和数据集将提供给社区。

In-place gesture-based virtual locomotion techniques enable users to control their viewpoint and intuitively move in the 3D virtual environment. A key research problem is to accurately and quickly recognize in-place gestures, since they can trigger specific movements of virtual viewpoints and enhance user experience. However, to achieve real-time experience, only short-term sensor sequence data (up to about 300ms, 6 to 10 frames) can be taken as input, which actually affects the classification performance due to limited spatio-temporal information. In this paper, we propose a novel long-term memory augmented network for in-place gestures classification. It takes as input both short-term gesture sequence samples and their corresponding long-term sequence samples that provide extra relevant spatio-temporal information in the training phase. We store long-term sequence features with an external memory queue. In addition, we design a memory augmented loss to help cluster features of the same class and push apart features from different classes, thus enabling our memory queue to memorize more relevant long-term sequence features. In the inference phase, we input only short-term sequence samples to recall the stored features accordingly, and fuse them together to predict the gesture class. We create a large-scale in-place gestures dataset from 25 participants with 11 gestures. Our method achieves a promising accuracy of 95.1% with a latency of 192ms, and an accuracy of 97.3% with a latency of 312ms, and is demonstrated to be superior to recent in-place gesture classification techniques. User study also validates our approach. Our source code and dataset will be made available to the community.

下载PDF全文

下载文献需遵守相关版权规定

论文标题