在有效的视频识别的背景下重新思考分辨率

论文标题

在有效的视频识别的背景下重新思考分辨率

Rethinking Resolution in the Context of Efficient Video Recognition

论文作者

Ma, Chuofan, Guo, Qiushan, Jiang, Yi, Yuan, Zehuan, Luo, Ping, Qi, Xiaojuan

论文摘要

在本文中，我们从经验上研究了如何充分利用低分辨率框架以进行有效的视频识别。现有的方法主要集中于开发紧凑的网络或减轻视频输入的时间冗余以提高效率，而压缩框架分辨率很少被认为是有希望的解决方案。一个主要问题是低分辨率帧的识别准确性不佳。因此，我们首先分析低分辨率框架上性能降解的根本原因。我们的主要发现是，降级的主要原因不是在下采样过程中信息丢失，而是网络体系结构和输入量表之间的不匹配。通过知识蒸馏（KD）的成功，我们建议通过跨分辨率KD（RESKD）弥合网络和输入大小之间的差距。我们的工作表明，RESKD是一种简单但有效的方法，可以提高低分辨率框架的识别精度。没有铃铛和哨子，RESKD就四个大型基准数据集（即ActivityNet，FCVID，Mini-Kinetics，Sothing soseings of Sothings v2）上的效率和准确性大大超过了所有竞争方法。此外，我们广泛证明了其对最先进的体系结构的有效性，即3D-CNN和视频变压器，以及对超低分辨率帧的可扩展性。结果表明，RESKD可以作为最先进视频识别的一般推理加速方法。我们的代码将在https://github.com/cvmi-lab/reskd上找到。

In this paper, we empirically study how to make the most of low-resolution frames for efficient video recognition. Existing methods mainly focus on developing compact networks or alleviating temporal redundancy of video inputs to increase efficiency, whereas compressing frame resolution has rarely been considered a promising solution. A major concern is the poor recognition accuracy on low-resolution frames. We thus start by analyzing the underlying causes of performance degradation on low-resolution frames. Our key finding is that the major cause of degradation is not information loss in the down-sampling process, but rather the mismatch between network architecture and input scale. Motivated by the success of knowledge distillation (KD), we propose to bridge the gap between network and input size via cross-resolution KD (ResKD). Our work shows that ResKD is a simple but effective method to boost recognition accuracy on low-resolution frames. Without bells and whistles, ResKD considerably surpasses all competitive methods in terms of efficiency and accuracy on four large-scale benchmark datasets, i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V2. In addition, we extensively demonstrate its effectiveness over state-of-the-art architectures, i.e., 3D-CNNs and Video Transformers, and scalability towards super low-resolution frames. The results suggest ResKD can serve as a general inference acceleration method for state-of-the-art video recognition. Our code will be available at https://github.com/CVMI-Lab/ResKD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题