论文标题
表示流视频分析的回收
Representation Recycling for Streaming Video Analysis
论文作者
论文摘要
我们提出了StreamDeq,该方法旨在推断以最少的人均计算视频中的框架表示。在没有临时解决方案的情况下,传统的深网确实在每帧中从头开始提取提取。相反,我们旨在建立流媒体识别模型,可以在连续的视频帧之间在本质上利用时间平滑度。我们观察到,最近新兴的隐式层模型为构建此类模型提供了方便的基础,因为它们将表示为浅网络的固定点,需要使用迭代方法来估算。我们的主要见解是通过将最新表示形式用作每个帧的起点,在时间轴上分发推理迭代。该方案有效地回收了最近的推理计算,并大大减少了所需的处理时间。通过广泛的实验分析,我们表明StreamDeq能够在几个帧的时间内恢复近乎最佳的表示形式,并在整个视频持续时间内保持最新的表示。我们在视频中进行了视频语义细分,视频对象检测和人姿势估计的实验表明,StreamDeq以基线的基线达到了标准的准确性,同时又要快地超过2-4倍。
We present StreamDEQ, a method that aims to infer frame-wise representations on videos with minimal per-frame computation. Conventional deep networks do feature extraction from scratch at each frame in the absence of ad-hoc solutions. We instead aim to build streaming recognition models that can natively exploit temporal smoothness between consecutive video frames. We observe that the recently emerging implicit layer models provide a convenient foundation to construct such models, as they define representations as the fixed-points of shallow networks, which need to be estimated using iterative methods. Our main insight is to distribute the inference iterations over the temporal axis by using the most recent representation as a starting point at each frame. This scheme effectively recycles the recent inference computations and greatly reduces the needed processing time. Through extensive experimental analysis, we show that StreamDEQ is able to recover near-optimal representations in a few frames' time and maintain an up-to-date representation throughout the video duration. Our experiments on video semantic segmentation, video object detection, and human pose estimation in videos show that StreamDEQ achieves on-par accuracy with the baseline while being more than 2-4x faster.