论文标题
对CNN基于CNN的时空表示的深入分析
Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition
论文作者
论文摘要
近年来,已经出现了许多基于2D或3D卷积神经网络(CNN)的方法,以进行视频动作识别,从而在几个大型基准数据集上实现了最新的结果。在本文中,我们进行了深入的比较分析,以更好地了解这些方法与它们取得的进展之间的差异。为此,我们为2D-CNN和3D-CNN动作模型开发了一个统一的框架,这使我们能够消除铃铛和哨子,并为公平比较提供了共同点。然后,我们努力进行大规模分析,涉及300多个行动识别模型。我们的全面分析表明,a)a)在效率上取得了重大飞跃,以识别行动,但不能准确地进行; b)在时空表示能力和可传递性方面,2D-CNN和3D-CNN模型的行为相似。我们的代码可在https://github.com/ibm/action-recognition-pytorch上找到。
In recent years, a number of approaches based on 2D or 3D convolutional neural networks (CNN) have emerged for video action recognition, achieving state-of-the-art results on several large-scale benchmark datasets. In this paper, we carry out in-depth comparative analysis to better understand the differences between these approaches and the progress made by them. To this end, we develop an unified framework for both 2D-CNN and 3D-CNN action models, which enables us to remove bells and whistles and provides a common ground for fair comparison. We then conduct an effort towards a large-scale analysis involving over 300 action recognition models. Our comprehensive analysis reveals that a) a significant leap is made in efficiency for action recognition, but not in accuracy; b) 2D-CNN and 3D-CNN models behave similarly in terms of spatio-temporal representation abilities and transferability. Our codes are available at https://github.com/IBM/action-recognition-pytorch.