论文标题
第一人称行动识别的运动和外观的自我监督联合编码
Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition
论文作者
论文摘要
可穿戴摄像机在几种应用中变得越来越流行,从而增加了研究界从第一人称角度识别行动的方法的兴趣。以自我为中心行动识别的公开挑战是,视频缺乏有关主要演员姿势的详细信息,因此在专注于操纵任务时倾向于仅记录运动的一部分。因此,有关动作本身的信息量是有限的,这使得对操纵对象及其背景的理解至关重要。许多以前的作品都使用两流体系结构解决了这个问题,其中一个流致力于建模动作中涉及的对象的外观,而另一个流则是从光流中提取运动特征。在本文中,我们认为从这两个信息渠道共同学习的学习特征对更好地捕获两者之间的时空相关性是有益的。为此,我们提出了一个能够这样做的单流体系结构,这要归功于添加了一个使用借口运动预测任务来交织运动和外观知识的自我监管的块。几个公开可用数据库的实验显示了我们方法的力量。
Wearable cameras are becoming more and more popular in several applications, increasing the interest of the research community in developing approaches for recognizing actions from the first-person point of view. An open challenge in egocentric action recognition is that videos lack detailed information about the main actor's pose and thus tend to record only parts of the movement when focusing on manipulation tasks. Thus, the amount of information about the action itself is limited, making crucial the understanding of the manipulated objects and their context. Many previous works addressed this issue with two-stream architectures, where one stream is dedicated to modeling the appearance of objects involved in the action, and another to extracting motion features from optical flow. In this paper, we argue that learning features jointly from these two information channels is beneficial to capture the spatio-temporal correlations between the two better. To this end, we propose a single stream architecture able to do so, thanks to the addition of a self-supervised block that uses a pretext motion prediction task to intertwine motion and appearance knowledge. Experiments on several publicly available databases show the power of our approach.