论文标题
Action Kepoint网络用于有效的视频识别
Action Keypoint Network for Efficient Video Recognition
论文作者
论文摘要
降低冗余对于提高视频识别模型的效率至关重要。一种有效的方法是从整体视频中选择内容丰富的内容,从而产生一个流行的动态视频识别方法。但是,现有的动态方法专注于时间或空间选择,同时忽略了裁员通常是空间和时间上的现实。此外,他们所选的内容通常以固定形状裁剪,而信息内容的现实分布可能会更加多样化。有了这两个见解,本文提议将时间和空间选择集成到Action Kepoint网络(AK-NET)中。从不同的框架和位置,AK-NET选择了在任意形状区域中散布的一些信息点作为一组动作关键点,然后将视频识别转换为点云分类。 AK-NET有两个步骤,即关键点选择和点云分类。首先,它将视频输入到基线网络中,并从中间层输出特征映射。我们将此特征图上的每个像素视为一个时空点,并使用自我注意力选择一些信息的关键点。其次,AK-NET设计了一个排名标准,以将关键点排列到有序的1D序列中。因此,AK-NET为效率带来了两倍的好处:关键点选择步骤在任意形状内收集信息性内容,并提高了建模时空依赖性的效率,而点云分类步骤进一步降低了计算成本,从而通过压实卷积核心来降低计算成本。实验结果表明,AK-NET可以始终提高几种视频识别基准测试基线方法的效率和性能。
Reducing redundancy is crucial for improving the efficiency of video recognition models. An effective approach is to select informative content from the holistic video, yielding a popular family of dynamic video recognition methods. However, existing dynamic methods focus on either temporal or spatial selection independently while neglecting a reality that the redundancies are usually spatial and temporal, simultaneously. Moreover, their selected content is usually cropped with fixed shapes, while the realistic distribution of informative content can be much more diverse. With these two insights, this paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net). From different frames and positions, AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of action keypoints and then transforms the video recognition into point cloud classification. AK-Net has two steps, i.e., the keypoint selection and the point cloud classification. First, it inputs the video into a baseline network and outputs a feature map from an intermediate layer. We view each pixel on this feature map as a spatial-temporal point and select some informative keypoints using self-attention. Second, AK-Net devises a ranking criterion to arrange the keypoints into an ordered 1D sequence. Consequentially, AK-Net brings two-fold benefits for efficiency: The keypoint selection step collects informative content within arbitrary shapes and increases the efficiency for modeling spatial-temporal dependencies, while the point cloud classification step further reduces the computational cost by compacting the convolutional kernels. Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.