更深入地了解什么深层时空网络编码：量化静态与动态信息

论文标题

更深入地了解什么深层时空网络编码：量化静态与动态信息

A Deeper Dive Into What Deep Spatiotemporal Networks Encode: Quantifying Static vs. Dynamic Information

论文作者

Kowal, Matthew, Siam, Mennatullah, Islam, Md Amirul, Bruce, Neil D. B., Wildes, Richard P., Derpanis, Konstantinos G.

论文摘要

深层时空模型用于各种计算机视觉任务，例如动作识别和视频对象分割。当前，对这些模型在其中间表示中捕获的信息的了解有限。例如，尽管已经观察到，在单个静态框架中，与视觉外观相比，与动态信息相比，尚无定量方法来评估潜在表示中这种静态偏见的方法（例如，运动）。我们通过提出一种量化任何时空模型的静态和动态偏见的新方法来应对这一挑战。为了显示我们方法的功效，我们分析了两个广泛研究的任务，即动作识别和视频对象分割。我们的主要发现是三重：（i）最检查的时空模型偏向静态信息；虽然，具有交叉连接的某些两流体系结构在捕获的静态和动态信息之间取得了更好的平衡。（ii）一些通常认为偏向动力学的数据集实际上偏向静态信息。（iii）体系结构中的单个单元（通道）可能会偏向两者的静态，动态或组合。

Deep spatiotemporal models are used in a variety of computer vision tasks, such as action recognition and video object segmentation. Currently, there is a limited understanding of what information is captured by these models in their intermediate representations. For example, while it has been observed that action recognition algorithms are heavily influenced by visual appearance in single static frames, there is no quantitative methodology for evaluating such static bias in the latent representation compared to bias toward dynamic information (e.g. motion). We tackle this challenge by proposing a novel approach for quantifying the static and dynamic biases of any spatiotemporal model. To show the efficacy of our approach, we analyse two widely studied tasks, action recognition and video object segmentation. Our key findings are threefold: (i) Most examined spatiotemporal models are biased toward static information; although, certain two-stream architectures with cross-connections show a better balance between the static and dynamic information captured. (ii) Some datasets that are commonly assumed to be biased toward dynamics are actually biased toward static information. (iii) Individual units (channels) in an architecture can be biased toward static, dynamic or a combination of the two.

下载PDF全文

下载文献需遵守相关版权规定

论文标题