对CNN基于CNN的时空表示的深入分析

论文标题

对CNN基于CNN的时空表示的深入分析

Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition

论文作者

Chen, Chun-Fu, Panda, Rameswar, Ramakrishnan, Kandan, Feris, Rogerio, Cohn, John, Oliva, Aude, Fan, Quanfu

论文摘要

近年来，已经出现了许多基于2D或3D卷积神经网络（CNN）的方法，以进行视频动作识别，从而在几个大型基准数据集上实现了最新的结果。在本文中，我们进行了深入的比较分析，以更好地了解这些方法与它们取得的进展之间的差异。为此，我们为2D-CNN和3D-CNN动作模型开发了一个统一的框架，这使我们能够消除铃铛和哨子，并为公平比较提供了共同点。然后，我们努力进行大规模分析，涉及300多个行动识别模型。我们的全面分析表明，a）a）在效率上取得了重大飞跃，以识别行动，但不能准确地进行； b）在时空表示能力和可传递性方面，2D-CNN和3D-CNN模型的行为相似。我们的代码可在https://github.com/ibm/action-recognition-pytorch上找到。

In recent years, a number of approaches based on 2D or 3D convolutional neural networks (CNN) have emerged for video action recognition, achieving state-of-the-art results on several large-scale benchmark datasets. In this paper, we carry out in-depth comparative analysis to better understand the differences between these approaches and the progress made by them. To this end, we develop an unified framework for both 2D-CNN and 3D-CNN action models, which enables us to remove bells and whistles and provides a common ground for fair comparison. We then conduct an effort towards a large-scale analysis involving over 300 action recognition models. Our comprehensive analysis reveals that a) a significant leap is made in efficiency for action recognition, but not in accuracy; b) 2D-CNN and 3D-CNN models behave similarly in terms of spatio-temporal representation abilities and transferability. Our codes are available at https://github.com/IBM/action-recognition-pytorch.

下载PDF全文

下载文献需遵守相关版权规定

论文标题